db maintenance "delete oldest least used tokens, but maintain count of x"
David Relson
relson at osagesoftware.com
Wed Mar 17 14:02:40 CET 2004
On Wed, 17 Mar 2004 13:23:02 +0100
Boris 'pi' Piwinger wrote:
> David Relson wrote:
>
> >> Having looked (but not posted), it appears as though the MSG_COUNT
> >was> used to evaluate the individual spamicity of the token only, and
> >hence> dropping tokens (without changing their associated spam/ham
> >counts)> should be safe. This all providing that I haven't missed a
> >reference> to the MSG_COUNT.
> >
> > That's correct. AFAICT removing unwanted tokens from the wordlist is
> > OK. It has the obvious effects - smaller wordlist and tokens
> > becoming unknown. It doesn't affect the remaining tokens. Of
> > course if someone delete the wrong tokens, the effects will be
> > serious.
>
> The danger seems to be that once the tokens return, there
> values will be horribly wrong.
Rather than "horribly wrong", how about "severely biased". Remember
that the effect depends on how the wordlist is built/maintained.
Using auto-update, I train on everything. So I have many tokens and
lots of hapaxes (tokens only seen once). The old hapaxes are tokens
that have not reappeared (else they'd have counts greater than 1). If I
decided to trim my wordlist, I'd probably choose that category, i.e. old
hapaxes, with the expectation of minimal side effects.
Using train-on-error, the wordlists are much smaller and token counts
are much lower. Removing any tokens is more likely to affect scoring (I
think).
It's a good thing bogofilter has so many ways to be used. It leads to
interesting discussions.
More information about the Bogofilter
mailing list