Hapax survival over time

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Wed Mar 24 14:22:21 CET 2004


David Relson wrote:

>>From Oct 2002 through Dec 2002 _all_ my mail was used in training
> (through the combined magic of autoupdate and manual handling of unsures
> and errors).  Since training changes a token's ham/spam counts and
> timestamp, I can look for old hapaxes and know that they haven't been
> seen again.  FWIW, I have 
> 
>     70,143 hapaxes from 2002 and 
>    260,326 from the first half of 2003
>    386,290 from the second half of 2003
> 
> Looks like many hapaxes do not reappear.

Is it correct to assume that you did not rebuild your
database on lexer changes? So that could also include
hapaxes which will never occur because the lexer now rejects
them.

pi




More information about the Bogofilter mailing list