false positive

Mon Jan 20 22:06:51 CET 2003

Hello Barry,

I can't (won't?) answer all your questions, but I can tackle a few of 'em.

Bogofilter is more interested in ratios, i.e. how common is a token in a 
list (as a ratio to the number of messagees in the list), than 
absolutes.  So it doesn't directly use any of 4, 12, 8237, or 31396.  They 
are normalized and manipulated to get the token's spam probability.

I plugged the numbers into gnumeric, using the equations below (with 
originals being in graham.c and bogoutil.c):

	spamness = spam_count/spam_msg_count = 4/8237 = 0.000486
	goodness = good_count/good_msg_count = 12/31396 = 0.000382
	prob = spamness / (spamness + goodness) = 0.000486/(000486 + 0.000382) = 
0.5595740

Thought of differently, the token's spam count is 1/3 of its good count (4 
vs 12), while the number of spams is 26% the number of good messages (8237 
of 31396).  Since 1/3 and .26 are not too far different, it's reasonable 
that the probability is near 1/2.

Regarding the 0.99, you have to look at how the token scores are combined 
to compute the message score.  Roughly, its:

	for count = 1 to 15
		cum_prod = cum_prod * token_prob
		cum_inv_prod = cum_inv_prod * (1 - token_prob)

	prob = cum_prod / (product + invproduct)

When working with all 0.01 and 0.99's as your example does, the two 
cumulative products rapidly move to the limits (very low and very 
high).  If you use "-vvv", I think you'll be able to see those 
products.  So, the results of 0.99 seems reasonable to me.

To go off on a bit of a tangent, the Graham calculation only uses part of 
the message, specifically the 15 tokens with scores furthest from 
0.5.  Given the same set of words, but a different ordering, the score 
could be 0.0 (based on 15 tokens with scores of 0.01) or could be 1.0 
(based of 15 tokens with scores of 0.99).

This is _not_good_.  It's the primary reason that release 0.9.1.2 switched 
from the Graham algorithm to the Robinson algorithm.  Robinson looks at all 
words in the message, so gives a more meaningful score.  Robinson also has 
a parameter, called min_dev (referring to minimum deviation from even odds 
(0.5)) that can be used to ignore words close to even odds (otherwise known 
as neutral words).

Then, too, there's the Robinson-Fisher algorithm.  It takes the Robinson 
score and the number of tokens that went into the computation, does some 
mathematical magic (also called a chi-square test), and determines whether 
the Robinson score _really_ indicates spam, _really_ indicates ham, or 
whether no accurate call is possible.  Using parameters named spam_cutoff 
and ham_cutoff, high values indicate spam, low values indicate ham, and 
mid-range values indicate that the result is indeterminate.  Stated more 
succinctly, the Robinson-Fisher allows a ternary classification of 
spam/ham/unsure.

Why do I mention all this?  Because you really ought to consider updating 
to newer code and newer algorithm. 0.9.1.2 is the current stable version 
and 0..10.0 was released as the current (beta) version.  Check them out!

David