db maintenance "delete oldest least used tokens, but maintain count of x"

Wed Mar 17 14:09:24 CET 2004

David Relson wrote:

>> The danger seems to be that once the tokens return, there
>> values will be horribly wrong.
> 
> Rather than "horribly wrong", how about "severely biased".  Remember
> that the effect depends on how the wordlist is built/maintained.  

Certainly it does. But imagine one of those tokens was
hammish and not suddenly becomes spammish (or vice versa).
Normally it would just become insignificant. This way it is
suddenly one-sided.

> Using auto-update, I train on everything.  So I have many tokens and
> lots of hapaxes (tokens only seen once).  The old hapaxes are tokens
> that have not reappeared (else they'd have counts greater than 1).  If I
> decided to trim my wordlist, I'd probably choose that category, i.e. old
> hapaxes, with the expectation of minimal side effects.

With huge wordlists it won't matter, right.

> Using train-on-error, the wordlists are much smaller and token counts
> are much lower.  Removing any tokens is more likely to affect scoring (I
> think).

After all there is no reason to remove tokens. Once in a
while one might want to rebuild the list (like I will with
0.17.3 due to the lexer change).

> It's a good thing bogofilter has so many ways to be used.  It leads to
> interesting discussions.

And lots of confusions for those who don't really understand
the key concepts of the calculations;->

pi