Hapax survival over time

David Relson relson at osagesoftware.com
Wed Mar 24 14:34:15 CET 2004


On Wed, 24 Mar 2004 14:22:21 +0100
Boris 'pi' Piwinger wrote:

> David Relson wrote:
> 
> >>From Oct 2002 through Dec 2002 _all_ my mail was used in training
> > (through the combined magic of autoupdate and manual handling of
> > unsures and errors).  Since training changes a token's ham/spam
> > counts and timestamp, I can look for old hapaxes and know that they
> > haven't been seen again.  FWIW, I have 
> > 
> >     70,143 hapaxes from 2002 and 
> >    260,326 from the first half of 2003
> >    386,290 from the second half of 2003
> > 
> > Looks like many hapaxes do not reappear.
> 
> Is it correct to assume that you did not rebuild your
> database on lexer changes? So that could also include
> hapaxes which will never occur because the lexer now rejects
> them.

Your assumption is correct and your guess is also correct.  I know there
are VERP (variable envelope reply protocol) tokens like "list-12-relson"
in my wordlist.




More information about the Bogofilter mailing list