Data base "maintenance" (removing tokens) and MSG_COUNT?

Greg Louis glouis at dynamicro.on.ca
Tue Jul 15 18:04:03 CEST 2003


On 20030715 (Tue) at 1117:15 -0400, David Relson wrote:
> At 11:09 AM 7/15/03, Matthias Andree wrote:
> >Hi,
> >
> >what I am currently wondering about is:
> >
> >we register the token count and a message count, to obtain a certain
> >"spamicity" of an individual token.
> >
> >However, what happens if tokens are removed? We don't adjust MSG_COUNT
> >AFAIR, so I fear that in the long run, all individual token spamicity
> >values will be too low because the MSG_COUNT is too high, and the ROBX
> >might also be bogus.
> >
> >Does this need to be taken into account? If so, do we need to store more
> >information or adjust the .MSG_COUNT?
> 
> Matthias,
> 
> .MSG_COUNT is decremented when messages are removed via -N and -S.
> 
True; but what about maintenance mode?  If you remove tokens older than
x, or with counts less than y, .MSG_COUNT is no longer an accurate
reflection of the number of messages contributing to the training
database.

ROBX is a guess at a prior, and its value can be (and should be, and is
when bogotune is run) adjusted.  Its starting value should be derived
from the tokens actually in the database, so it's ok that it changes
when tokens are removed.  (Of course, if it's not recalculated, the old
value will be wrong, which may or may not have an impact.)

Removing tokens without correcting .MSG_COUNT does not invalidate the
counts of existing tokens, but newly added tokens will have lower
token-count-to-message-count ratios than is warranted, and will thus
have less influence than they should in comparison to the ones entered
before the pruning.

It would take an awfully long time (especially if one trains on errors
and unsures) for this effect to become really significant.  I'd still
advocate not using the maintenance mode of bogoutil to prune the
database, however, unless someone came up with a way to adjust
.MSG_COUNT appropriately.  That doesn't look easy to do.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |




More information about the Bogofilter mailing list