Data base "maintenance" (removing tokens) and MSG_COUNT?

Matthias Andree matthias.andree at gmx.de
Tue Jul 15 23:17:11 CEST 2003


Greg Louis <glouis at dynamicro.on.ca> writes:

> Removing tokens without correcting .MSG_COUNT does not invalidate the
> counts of existing tokens, but newly added tokens will have lower
> token-count-to-message-count ratios than is warranted,

That's my concern, in better words than I could have put it.

> and will thus have less influence than they should in comparison to
> the ones entered before the pruning.
>
> It would take an awfully long time (especially if one trains on errors
> and unsures) for this effect to become really significant.  I'd still
> advocate not using the maintenance mode of bogoutil to prune the
> database, however, unless someone came up with a way to adjust
> .MSG_COUNT appropriately.  That doesn't look easy to do.

No, and I expect it involves changing the data base contents (format,
what we store), just token and count may not be sufficient, we may also
need to store the ratio rather than an absolute token count. For
scalability however, we want to touch only the tokens we register (not
those that aren't present in the mail), and we might need to find a way
to incrementally update the ratio, if there is one. I know too little
about the algorithm's function to suggest anything offhand.

-- 
Matthias Andree




More information about the Bogofilter mailing list