mailing lists and hapaxes
David Relson
relson at osagesoftware.com
Thu Sep 25 02:43:14 CEST 2003
On 25 Sep 2003 10:20:17 +1000
michael at optusnet.com.au wrote:
> David Relson <relson at osagesoftware.com> writes:
>
> > Greetings,
> >
> > As part of another test, I grepped my wordlist for my userid and was
> > surprised to find 31,400 tokens containing it. Checking further, I
> [...]
> > This could be a reason to _not_ use '-u' (auto-update). It could
> > also be a reason to periodically delete hapaxes.
> >
> > Has anybody else noticed this phenomena? Any thoughts on how best
> > to deal with it?
>
> My best idea on this will grow the database. :(
>
> We could add 'time last written' for each token ( a 4 byte per token
> increase! ) and delete hapaxes where the create time is older
> than X ( 1 month? )
>
> My thinking here is that randomly deleting hapaxes is dangerous,
> because you don't know if they're about to turn into real tokens. But
> if they've remained an hapax for a month, it's pretty unlikely you'll
> see another one of them, so you can fairly safely kill it.
>
> Given that the majority of tokens in the database are normally hapaxes
> (where _does_ that term come from? :) , the overhead from adding a
> timestamp may be outweighed by the gain from shrinking the database.
Hi Michael,
Your reasoning sounds good.
By the way, bogofilter already has a timestamp of form YYYYMMDD (for
human readability). Option "-y 0" can be used to set it to zero, in
which case timestamping is disabled. The maintenance functions can be
used to delete tokens by date, count, and/or length.
My preference would be to keep those tokens out of the wordlist from the
get-go. However, I don't as yet have a way to do that. The answer may
well be as you suggest - delete old hapaxes.
David
P.S. Have you looked at the changed parsing in 0.15.4??? It should look
familiar to you :-)
More information about the Bogofilter
mailing list