mailing lists and hapaxes
michael at optusnet.com.au
michael at optusnet.com.au
Thu Sep 25 03:53:09 CEST 2003
David Relson <relson at osagesoftware.com> writes:
> On 25 Sep 2003 10:20:17 +1000
> michael at optusnet.com.au wrote:
[..]
> > Given that the majority of tokens in the database are normally hapaxes
> > (where _does_ that term come from? :) , the overhead from adding a
> > timestamp may be outweighed by the gain from shrinking the database.
>
> Hi Michael,
>
> Your reasoning sounds good.
>
> By the way, bogofilter already has a timestamp of form YYYYMMDD (for
> human readability). Option "-y 0" can be used to set it to zero, in
> which case timestamping is disabled. The maintenance functions can be
> used to delete tokens by date, count, and/or length.
Ahh. I should have read the code more carefully. :) So it's already there.
> My preference would be to keep those tokens out of the wordlist from the
> get-go. However, I don't as yet have a way to do that. The answer may
> well be as you suggest - delete old hapaxes.
>
> David
>
> P.S. Have you looked at the changed parsing in 0.15.4??? It should look
> familiar to you :-)
Looks familar indeed. :)
I note that you're still using 'head:' to mark header tokens. I worry
about the size impact this has on the database. When using db or
similar, that's 5 bytes added to every header token...
(If we were using a patricia trie then it wouldn't be a problem at
all because the stemming would take care of it).
Michael.
More information about the Bogofilter
mailing list