false positive
David Relson
relson at osagesoftware.com
Mon Jan 20 22:06:51 CET 2003
Hello Barry,
I can't (won't?) answer all your questions, but I can tackle a few of 'em.
Bogofilter is more interested in ratios, i.e. how common is a token in a
list (as a ratio to the number of messagees in the list), than
absolutes. So it doesn't directly use any of 4, 12, 8237, or 31396. They
are normalized and manipulated to get the token's spam probability.
I plugged the numbers into gnumeric, using the equations below (with
originals being in graham.c and bogoutil.c):
spamness = spam_count/spam_msg_count = 4/8237 = 0.000486
goodness = good_count/good_msg_count = 12/31396 = 0.000382
prob = spamness / (spamness + goodness) = 0.000486/(000486 + 0.000382) =
0.5595740
Thought of differently, the token's spam count is 1/3 of its good count (4
vs 12), while the number of spams is 26% the number of good messages (8237
of 31396). Since 1/3 and .26 are not too far different, it's reasonable
that the probability is near 1/2.
Regarding the 0.99, you have to look at how the token scores are combined
to compute the message score. Roughly, its:
for count = 1 to 15
cum_prod = cum_prod * token_prob
cum_inv_prod = cum_inv_prod * (1 - token_prob)
prob = cum_prod / (product + invproduct)
When working with all 0.01 and 0.99's as your example does, the two
cumulative products rapidly move to the limits (very low and very
high). If you use "-vvv", I think you'll be able to see those
products. So, the results of 0.99 seems reasonable to me.
To go off on a bit of a tangent, the Graham calculation only uses part of
the message, specifically the 15 tokens with scores furthest from
0.5. Given the same set of words, but a different ordering, the score
could be 0.0 (based on 15 tokens with scores of 0.01) or could be 1.0
(based of 15 tokens with scores of 0.99).
This is _not_good_. It's the primary reason that release 0.9.1.2 switched
from the Graham algorithm to the Robinson algorithm. Robinson looks at all
words in the message, so gives a more meaningful score. Robinson also has
a parameter, called min_dev (referring to minimum deviation from even odds
(0.5)) that can be used to ignore words close to even odds (otherwise known
as neutral words).
Then, too, there's the Robinson-Fisher algorithm. It takes the Robinson
score and the number of tokens that went into the computation, does some
mathematical magic (also called a chi-square test), and determines whether
the Robinson score _really_ indicates spam, _really_ indicates ham, or
whether no accurate call is possible. Using parameters named spam_cutoff
and ham_cutoff, high values indicate spam, low values indicate ham, and
mid-range values indicate that the result is indeterminate. Stated more
succinctly, the Robinson-Fisher allows a ternary classification of
spam/ham/unsure.
Why do I mention all this? Because you really ought to consider updating
to newer code and newer algorithm. 0.9.1.2 is the current stable version
and 0..10.0 was released as the current (beta) version. Check them out!
David
More information about the Bogofilter
mailing list