mailing lists and hapaxes
David Relson
relson at osagesoftware.com
Thu Sep 25 02:00:31 CEST 2003
Greetings,
As part of another test, I grepped my wordlist for my userid and was
surprised to find 31,400 tokens containing it. Checking further, I
found the culprit to be mailing lists that have serial numbered return
addresses, for example:
bogofilter-return-1000-relson 0 1
bogofilter-return-1001-relson 0 1
bogofilter-return-1002-relson 0 1
bogofilter-return-1003-relson 0 1
...
Of the 31,400 tokens, all except 238 are hapaxes, i.e. tokens that have
appeared exactly once.
This could be a reason to _not_ use '-u' (auto-update). It could also
be a reason to periodically delete hapaxes.
Has anybody else noticed this phenomena? Any thoughts on how best to
deal with it?
David
More information about the Bogofilter
mailing list