spam scores [was: fd]

David Relson relson at osagesoftware.com
Thu Mar 13 13:33:52 CET 2003


At 04:37 AM 3/13/03, Daniel Lublin wrote:


>Since version 0.11.1.2 (or possibly 0.11.1.1) of bogofilter (I use the
>Debian package), I get a spamicity value of 1.000000 for the emails
>that are classified as spam. What is tha matter with this? Could it
>have something to do with most (possibly all so far) of these
>particular spams being already mark up as spam by SpamAssassin?
>
>//Daniel

Daniel,

You're seeing the effect of the chi-square test in the Robinson-Fisher 
algorithm, which is the new default algorithm.  It doesn't relate to 
SpamAssassin's mark up (except possibly for a small effect).

Previously, the default algorithm was the Robinson-GM (geometric mean) 
method.  Roughly speaking, this method computes spam and ham scores for 
each unique token in the message, computes two cumulative products (one 
each for ham and spam), averages the two products, and ends up with a score 
between 0.0 and 1.0.

With the change to Robinson-Fisher, a chi-square test is added at the end 
of the Robinson-GM scoring.  The additional test uses the number of unique 
tokens and the Robinson-GM score and determines the likelihood the message 
is ham or spam.  Again this gives a score between 0.0 and 1.0.  The 
difference is that the new ham scores are much closer to zero and the new 
ham scores are much closer to one.  In scientific notation, ham scores 
often have values like 1.33e-06, 1.44e-12, etc and spam scores are often 
similarly (1.33e-06, etc) close to one.  If you want to see the additional 
detail, add the following line (explained in bogofilter.cf.example) to the 
end of your bogofilter.cf file:

spamicity_formats=%0.6e, %0.6e

Hope this helps :-)

David





More information about the Bogofilter mailing list