Data base "maintenance" (removing tokens) and MSG_COUNT?

David Relson relson at osagesoftware.com
Tue Jul 15 19:38:08 CEST 2003


At 12:10 PM 7/15/03, Matthias Andree wrote:
>David Relson <relson at osagesoftware.com> writes:
>
> > Your example shows .MSG_COUNT being deleted because it has a low count.
> > As it's a special token, it should _never_ be deleted.  I'll fix the
> > flaw in bogoutil and the corrected code will be in the next release.
>
>I wonder if that's the whole story. If we skew the individual token's
>spamicities (of the remaining tokens), nothing is gained by the "fix".

The whole story?  Not likely :-)

Bogofilter _needs_ a value for .MSG_COUNT in order to function.  Deleting 
.MSG_COUNT from the wordlist(s) will cause problems.  Using maintenance 
mode to delete other tokens will have effects, like smaller wordlist size, 
faster scoring, unrecognized tokens, etc.  These effects will change the 
scores for some messages.  That's the way the world lives.

 From the point of view of wordlist integrity, I don't think it's harmful 
to delete old tokens or tokens with particular counts.  For example, 
deleting tokens with counts of 1 is comparable to training with a slightly 
different set of messages, i.e. using messages from which those tokens 
(with the counts of 1) have been deleted.  This has a slight effect on 
bogofilter's ability to score (because there are fewer recognized 
tokens).  I don't think this effect is significant.






More information about the Bogofilter mailing list