spamicity values and switches
David Relson
relson at osagesoftware.com
Wed Apr 30 18:31:07 CEST 2003
Joerg,
When I have a have a question about why bogofilter is classifying a message
in a particular way, the first thing I do is generate a histogram using
switches "-F -vv". Below are _my_ histograms for bogofilter-faq.html (as
included in the 0.12.2 RPM).
Here's the histogram using '-r' (the Robinson geometric mean algorithm):
[relson at osage cvs]$ bogofilter -C -r -F -vv < doc/bogofilter-faq.html
X-Bogosity: No, tests=bogofilter, spamicity=0.268002, version=0.12.2
int cnt prob spamicity histogram
0.00 226 0.018153 0.007146 ################################################
0.10 65 0.160957 0.025498 ##############
0.20 61 0.249396 0.051598 #############
0.30 69 0.355610 0.091949 ###############
0.40 58 0.439720 0.130749 #############
0.50 58 0.544373 0.175591 #############
0.60 25 0.660243 0.198646 ######
0.70 21 0.747427 0.221001 #####
0.80 18 0.846808 0.244567 ####
0.90 14 0.930313 0.268002 ###
Here's the histogram using '-f' (the Robinson Fisher algorithm):
[relson at osage cvs]$ bogofilter -C -f -F -vv < doc/bogofilter-faq.html
X-Bogosity: No, tests=bogofilter, spamicity=0.000000, version=0.12.2
int cnt prob spamicity histogram
0.00 226 0.018153 0.008576 ################################################
0.10 65 0.160957 0.030530 ##############
0.20 61 0.249396 0.061397 #############
0.30 69 0.355610 0.108162 ###############
0.40 0 0.000000 0.108162
0.50 0 0.000000 0.108162
0.60 25 0.660243 0.144058 ######
0.70 21 0.747427 0.177642 #####
0.80 18 0.846808 0.211795 ####
0.90 14 0.930313 0.244536 ###
Be aware that -r (the Robinson geometric mean algorithm) and -f (the
Robinson Fisher algorithm) give different _numeric_ values but tend to give
the same Yes/No classification. The Fisher modification to the Robinson
algorithm adds a chi square calculation that produces a "confidence level"
indicating bogofilter's "sureness" of a right answer. Briefly, if there a
lot of tokens contributing to the score and the score is "different enough"
from 0.5, the Fisher score will be 0.00000 or 1.00000.
You might find it informative to look at the histogram for
bogofilter-faq.html. As you can see, with my wordlists the message has
many more low scoring (hammish) token than it has high scoring (spammish)
tokens.
I can, perhaps, take a look at the message you showed in your posting. To
do so, I need to have you update to 0.12.2 and convert the message to the
msg-count format (using script training/bogolex.sh). The msg-count format
will give me the actual numbers from your wordlists for the tokens in the
message. It'd good if you'd also include the msg-count format for
bogofilter-faq.html as I can compare that easily with what I have.
Hope this helps :-)
David
More information about the Bogofilter
mailing list