mailing lists and hapaxes
michael at optusnet.com.au
michael at optusnet.com.au
Thu Sep 25 02:20:17 CEST 2003
David Relson <relson at osagesoftware.com> writes:
> Greetings,
>
> As part of another test, I grepped my wordlist for my userid and was
> surprised to find 31,400 tokens containing it. Checking further, I
[...]
> This could be a reason to _not_ use '-u' (auto-update). It could also
> be a reason to periodically delete hapaxes.
>
> Has anybody else noticed this phenomena? Any thoughts on how best to
> deal with it?
My best idea on this will grow the database. :(
We could add 'time last written' for each token ( a 4 byte per token
increase! ) and delete hapaxes where the create time is older
than X ( 1 month? )
My thinking here is that randomly deleting hapaxes is dangerous, because
you don't know if they're about to turn into real tokens. But if
they've remained an hapax for a month, it's pretty unlikely you'll see
another one of them, so you can fairly safely kill it.
Given that the majority of tokens in the database are normally hapaxes
(where _does_ that term come from? :) , the overhead from adding a timestamp
may be outweighed by the gain from shrinking the database.
Comments?
Michael.
More information about the Bogofilter
mailing list