mailing lists and hapaxes

Thu Sep 25 02:43:14 CEST 2003

On 25 Sep 2003 10:20:17 +1000
michael at optusnet.com.au wrote:

> David Relson <relson at osagesoftware.com> writes:
> 
> > Greetings,
> > 
> > As part of another test, I grepped my wordlist for my userid and was
> > surprised to find 31,400 tokens containing it.  Checking further, I
> [...]
> > This could be a reason to _not_ use '-u' (auto-update).  It could
> > also be a reason to periodically delete hapaxes.
> > 
> > Has anybody else noticed this phenomena?  Any thoughts on how best
> > to deal with it?
> 
> My best idea on this will grow the database. :(
> 
> We could add 'time last written' for each token ( a 4 byte per token
> increase! ) and delete hapaxes where the create time is older
> than X ( 1 month? ) 
> 
> My thinking here is that randomly deleting hapaxes is dangerous,
> because you don't know if they're about to turn into real tokens. But
> if they've remained an hapax for a month, it's pretty unlikely you'll
> see another one of them, so you can fairly safely kill it.
> 
> Given that the majority of tokens in the database are normally hapaxes
> (where _does_ that term come from? :) , the overhead from adding a
> timestamp may be outweighed by the gain from shrinking the database.

Hi Michael,

Your reasoning sounds good.  

By the way, bogofilter already has a timestamp of form YYYYMMDD (for
human readability).  Option "-y 0" can be used to set it to zero, in
which case timestamping is disabled.  The maintenance functions can be
used to delete tokens by date, count, and/or length.

My preference would be to keep those tokens out of the wordlist from the
get-go.  However, I don't as yet have a way to do that.  The answer may
well be as you suggest - delete old hapaxes.

David

P.S. Have you looked at the changed parsing in 0.15.4???  It should look
familiar to you :-)