mailing lists and hapaxes

michael at optusnet.com.au michael at optusnet.com.au
Thu Sep 25 03:53:09 CEST 2003


David Relson <relson at osagesoftware.com> writes:
> On 25 Sep 2003 10:20:17 +1000
> michael at optusnet.com.au wrote:
[..]
> > Given that the majority of tokens in the database are normally hapaxes
> > (where _does_ that term come from? :) , the overhead from adding a
> > timestamp may be outweighed by the gain from shrinking the database.
> 
> Hi Michael,
> 
> Your reasoning sounds good.  
> 
> By the way, bogofilter already has a timestamp of form YYYYMMDD (for
> human readability).  Option "-y 0" can be used to set it to zero, in
> which case timestamping is disabled.  The maintenance functions can be
> used to delete tokens by date, count, and/or length.

Ahh. I should have read the code more carefully. :) So it's already there.
 
> My preference would be to keep those tokens out of the wordlist from the
> get-go.  However, I don't as yet have a way to do that.  The answer may
> well be as you suggest - delete old hapaxes.
> 
> David
> 
> P.S. Have you looked at the changed parsing in 0.15.4???  It should look
> familiar to you :-)

Looks familar indeed. :)

I note that you're still using 'head:' to mark header tokens. I worry
about the size impact this has on the database. When using db or
similar, that's 5 bytes added to every header token...

(If we were using a patricia trie then it wouldn't be a problem at
all because the stemming would take care of it).

Michael.




More information about the Bogofilter mailing list