mailing lists and hapaxes

michael at optusnet.com.au michael at optusnet.com.au
Thu Sep 25 02:20:17 CEST 2003


David Relson <relson at osagesoftware.com> writes:

> Greetings,
> 
> As part of another test, I grepped my wordlist for my userid and was
> surprised to find 31,400 tokens containing it.  Checking further, I
[...]
> This could be a reason to _not_ use '-u' (auto-update).  It could also
> be a reason to periodically delete hapaxes.
> 
> Has anybody else noticed this phenomena?  Any thoughts on how best to
> deal with it?

My best idea on this will grow the database. :(

We could add 'time last written' for each token ( a 4 byte per token
increase! ) and delete hapaxes where the create time is older
than X ( 1 month? ) 

My thinking here is that randomly deleting hapaxes is dangerous, because
you don't know if they're about to turn into real tokens. But if
they've remained an hapax for a month, it's pretty unlikely you'll see
another one of them, so you can fairly safely kill it.

Given that the majority of tokens in the database are normally hapaxes
(where _does_ that term come from? :) , the overhead from adding a timestamp
may be outweighed by the gain from shrinking the database.

Comments?

Michael.




More information about the Bogofilter mailing list