mailing lists and hapaxes

David Relson relson at osagesoftware.com
Thu Sep 25 02:00:31 CEST 2003


Greetings,

As part of another test, I grepped my wordlist for my userid and was
surprised to find 31,400 tokens containing it.  Checking further, I
found the culprit to be mailing lists that have serial numbered return
addresses, for example:

bogofilter-return-1000-relson 0 1 
bogofilter-return-1001-relson 0 1 
bogofilter-return-1002-relson 0 1 
bogofilter-return-1003-relson 0 1 
...

Of the 31,400 tokens, all except 238 are hapaxes, i.e. tokens that have
appeared exactly once.

This could be a reason to _not_ use '-u' (auto-update).  It could also
be a reason to periodically delete hapaxes.

Has anybody else noticed this phenomena?  Any thoughts on how best to
deal with it?

David




More information about the Bogofilter mailing list