Data base "maintenance" (removing tokens) and MSG_COUNT?
David Relson
relson at osagesoftware.com
Tue Jul 15 19:38:08 CEST 2003
At 12:10 PM 7/15/03, Matthias Andree wrote:
>David Relson <relson at osagesoftware.com> writes:
>
> > Your example shows .MSG_COUNT being deleted because it has a low count.
> > As it's a special token, it should _never_ be deleted. I'll fix the
> > flaw in bogoutil and the corrected code will be in the next release.
>
>I wonder if that's the whole story. If we skew the individual token's
>spamicities (of the remaining tokens), nothing is gained by the "fix".
The whole story? Not likely :-)
Bogofilter _needs_ a value for .MSG_COUNT in order to function. Deleting
.MSG_COUNT from the wordlist(s) will cause problems. Using maintenance
mode to delete other tokens will have effects, like smaller wordlist size,
faster scoring, unrecognized tokens, etc. These effects will change the
scores for some messages. That's the way the world lives.
From the point of view of wordlist integrity, I don't think it's harmful
to delete old tokens or tokens with particular counts. For example,
deleting tokens with counts of 1 is comparable to training with a slightly
different set of messages, i.e. using messages from which those tokens
(with the counts of 1) have been deleted. This has a slight effect on
bogofilter's ability to score (because there are fewer recognized
tokens). I don't think this effect is significant.
More information about the Bogofilter
mailing list