Training from scratch.

Tom Anderson tanderso at oac-design.com
Thu Jul 15 16:34:59 CEST 2004


From: "Tom Eastman" <tom-lists at celleste.no-ip.org>
> I had intuited that a low min_dev would mean that there was more
neautral-ish
> tokens that would push the score towards 0.5.

To acheive that (neutralizing tokens with low counts), you want to increase
your robs value.  That will increase the "inertia" from robx.  And robx
should generally be on the slightly hammy side of 0.5 to bias against false
positives (it takes more "escape velocity" to become spammy than hammy).

A small min_dev would mean that tokens would effect scoring after their
first registration.  With a higher min_dev (and significant enough robs), it
should take a few registrations before a token will effect scoring.  This is
good because you can be relatively sure it's being classified correctly once
you have seen it several times and it has consistently pushed itself in the
same direction.  Otherwise it could flip-flop close to robx, sometimes
contributing to ham, sometimes to spam.  In all cases, your min_dev range
should at least encompass your robx value so that tokens do not effect
classifications the very first time they are seen.

Consider emails containing "this is a spam" and "this is a ham".  The tokens
"this", "is", and "a", will probably be seen in lots of emails, both ham and
spam.  If you register either of those emails first though, it will make the
classification of the other one wrong.  With a high min_dev and robs, those
tokens will still be in the min_dev range, and the whole email will classify
at robx, which should be unsure.  This is better than a misclassification.
Assuming you then see the token "spam" in lots of spams, and "ham" in lots
of hams, those two tokens will later contribute correctly to classifications
whereas the tokens "this", "is", and "a" (assuming they were tokens that
contained enough characters) would probably remain within the non-scoring
min_dev range, even if they occasionally sit on one side or the other of
0.5.

Does that make any more sense?

Tom




More information about the Bogofilter mailing list