Hapax survival over time

Bill McClain wmcclain at salamander.com
Wed Mar 24 13:57:01 CET 2004

On 23 Mar 2004 23:09:39 -0500
Tom Anderson <tanderso at oac-design.com> wrote:

> I'm not sure a longer period of time is really necessary.  Clearly if
> a token has been seen only once in 20-30 days, it does not play a very
> large roll in classifying the vast bulk of your messages.  Therefore,
> it could not possibly hurt to delete it and then score it at robx on
> day 31+.  How strong of an indicator could it be if it is seen so
> infrequently?

I'm becoming aware of a "secret life of spam", normally invisible.
Looking at my data, old hapaxes are still being struck off the wordlist
by ones and twos each day, even months after they were first collected.
But look at this. Here is an extract of the record for hapaxes collected
on Jan 25. Column 1 = date, 2 = hapaxes eliminated that day, 3 = hapaxes
eliminated as a percent of the total remaining from the original day:

    20040301    6   0.4
    20040302    2   0.1
    20040303    5   0.4
    20040304    3   0.2
    20040305   60   4.3
    20040306    9   0.7
    20040307    2   0.2
    20040308    2   0.2
    20040309    4   0.3
    20040310    4   0.3
    20040311   65   5.0
    20040312    2   0.2
    20040313    0   0.0
    20040314    2   0.2

Over a month after being collected, hapaxes are being used in small
numbers, then we see large bursts on Mar 5 and 11. So the Jan 25 hapaxes
were particularly valuable those days. I speculate that some types of
spam come around again at long intervals; why, I don't know.

Still, this doesn't answer the question "how valuable are the old
hapaxes?" They are providing some value, but perhaps the messages that
use them were already very spammy. 

It would be interesting to know: given a hapax "X", scored as spam
because of the overall score of its message, how likely is it to stay
spammy when it is seen again, or instead to drift toward neutrality, or
even to cross into ham territory? 

I spent a small amount of time actually examining the hapaxes that are
eliminated each day, trying to see if the tokens were of a specific
type. They seemed to be of all types.

Sattre Press                                The King in Yellow
http://sattre-press.com/                 by Robert W. Chambers
info at sattre-press.com         http://sattre-press.com/kiy.html

More information about the Bogofilter mailing list