spam scores [was: fd]
David Relson
relson at osagesoftware.com
Thu Mar 13 13:33:52 CET 2003
At 04:37 AM 3/13/03, Daniel Lublin wrote:
>Since version 0.11.1.2 (or possibly 0.11.1.1) of bogofilter (I use the
>Debian package), I get a spamicity value of 1.000000 for the emails
>that are classified as spam. What is tha matter with this? Could it
>have something to do with most (possibly all so far) of these
>particular spams being already mark up as spam by SpamAssassin?
>
>//Daniel
Daniel,
You're seeing the effect of the chi-square test in the Robinson-Fisher
algorithm, which is the new default algorithm. It doesn't relate to
SpamAssassin's mark up (except possibly for a small effect).
Previously, the default algorithm was the Robinson-GM (geometric mean)
method. Roughly speaking, this method computes spam and ham scores for
each unique token in the message, computes two cumulative products (one
each for ham and spam), averages the two products, and ends up with a score
between 0.0 and 1.0.
With the change to Robinson-Fisher, a chi-square test is added at the end
of the Robinson-GM scoring. The additional test uses the number of unique
tokens and the Robinson-GM score and determines the likelihood the message
is ham or spam. Again this gives a score between 0.0 and 1.0. The
difference is that the new ham scores are much closer to zero and the new
ham scores are much closer to one. In scientific notation, ham scores
often have values like 1.33e-06, 1.44e-12, etc and spam scores are often
similarly (1.33e-06, etc) close to one. If you want to see the additional
detail, add the following line (explained in bogofilter.cf.example) to the
end of your bogofilter.cf file:
spamicity_formats=%0.6e, %0.6e
Hope this helps :-)
David
More information about the Bogofilter
mailing list