Hapax survival over time

David Relson relson at osagesoftware.com
Wed Mar 24 14:18:05 CET 2004


On Wed, 24 Mar 2004 06:57:01 -0600
Bill McClain wrote:

..[snip]...

> Still, this doesn't answer the question "how valuable are the old
> hapaxes?" They are providing some value, but perhaps the messages that
> use them were already very spammy. 
> 
> It would be interesting to know: given a hapax "X", scored as spam
> because of the overall score of its message, how likely is it to stay
> spammy when it is seen again, or instead to drift toward neutrality,
> or even to cross into ham territory? 
> 
> I spent a small amount of time actually examining the hapaxes that are
> eliminated each day, trying to see if the tokens were of a specific
> type. They seemed to be of all types.

>From Oct 2002 through Dec 2002 _all_ my mail was used in training
(through the combined magic of autoupdate and manual handling of unsures
and errors).  Since training changes a token's ham/spam counts and
timestamp, I can look for old hapaxes and know that they haven't been
seen again.  FWIW, I have 

    70,143 hapaxes from 2002 and 
   260,326 from the first half of 2003
   386,290 from the second half of 2003

Looks like many hapaxes do not reappear.




More information about the Bogofilter mailing list