Data base "maintenance" (removing tokens) and MSG_COUNT?

Greg Louis glouis at dynamicro.on.ca
Tue Jul 15 23:52:17 CEST 2003


On 20030715 (Tue) at 2317:11 +0200, Matthias Andree wrote:

> > It would take an awfully long time (especially if one trains on errors
> > and unsures) for this effect to become really significant.  I'd still
> > advocate not using the maintenance mode of bogoutil to prune the
> > database, however, unless someone came up with a way to adjust
> > .MSG_COUNT appropriately.  That doesn't look easy to do.
> 
> No, and I expect it involves changing the data base contents (format,
> what we store), just token and count may not be sufficient, we may also
> need to store the ratio rather than an absolute token count. For
> scalability however, we want to touch only the tokens we register (not
> those that aren't present in the mail), and we might need to find a way
> to incrementally update the ratio, if there is one. I know too little
> about the algorithm's function to suggest anything offhand.

Storing ratios would require that we update them all whenever
.MSG_COUNT changed, a tall order.  Just updating the ones in the new
message doesn't do the trick.

I've been looking recently at an alternative to Graham's p(w)
calculation proposed by a person named Joe Marshall, who points out
that the absence of a token from the message is, in real Bayesian
analysis, a source of information as well as the presence.  At first I
thought this meant traversing the whole db for every message, but Joe
had another good idea: each time you update the training db,
recalculate the score that a message with no tokens (all tokens in the
database are absent) would receive.  Then, when classifying, start with
the no-token score, and for each token in the message, divide by its
"absent" index and multiply by its "present" index.  The resulting
product is the score of the message, with both present and absent
tokens contributing.

This, while theoretically superior, may not be of sufficient practical
benefit to be worth implementing.  We're investigating that point now.
Should it turn out to be worthwhile, storing ratios would be unhelpful;
if anything, we'd want to store the "absent" and "present" indices
(they too would have to be updated with every db change, though).

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |




More information about the Bogofilter mailing list