db maintenance "delete oldest least used tokens, but maintain count of x"

Wed Mar 17 14:02:40 CET 2004

On Wed, 17 Mar 2004 13:23:02 +0100
Boris 'pi' Piwinger wrote:

> David Relson wrote:
> 
> >> Having looked (but not posted), it appears as though the MSG_COUNT
> >was> used to evaluate the individual spamicity of the token only, and
> >hence> dropping tokens (without changing their associated spam/ham
> >counts)> should be safe. This all providing that I haven't missed a
> >reference> to the MSG_COUNT.
> > 
> > That's correct. AFAICT removing unwanted tokens from the wordlist is
> > OK. It has the obvious effects - smaller wordlist and tokens
> > becoming unknown.  It doesn't affect the remaining tokens.  Of
> > course if someone delete the wrong tokens, the effects will be
> > serious.
> 
> The danger seems to be that once the tokens return, there
> values will be horribly wrong.

Rather than "horribly wrong", how about "severely biased".  Remember
that the effect depends on how the wordlist is built/maintained.  

Using auto-update, I train on everything.  So I have many tokens and
lots of hapaxes (tokens only seen once).  The old hapaxes are tokens
that have not reappeared (else they'd have counts greater than 1).  If I
decided to trim my wordlist, I'd probably choose that category, i.e. old
hapaxes, with the expectation of minimal side effects.

Using train-on-error, the wordlists are much smaller and token counts
are much lower.  Removing any tokens is more likely to affect scoring (I
think).

It's a good thing bogofilter has so many ways to be used.  It leads to
interesting discussions.