spamicity values and switches

Wed Apr 30 18:31:07 CEST 2003

Joerg,

When I have a have a question about why bogofilter is classifying a message 
in a particular way, the first thing I do is generate a histogram using 
switches "-F -vv".   Below are _my_ histograms for bogofilter-faq.html (as 
included in the 0.12.2 RPM).

Here's the histogram using '-r' (the Robinson geometric mean algorithm):

[relson at osage cvs]$ bogofilter -C -r -F -vv < doc/bogofilter-faq.html
X-Bogosity: No, tests=bogofilter, spamicity=0.268002, version=0.12.2
    int  cnt   prob  spamicity histogram
   0.00  226 0.018153 0.007146 ################################################
   0.10   65 0.160957 0.025498 ##############
   0.20   61 0.249396 0.051598 #############
   0.30   69 0.355610 0.091949 ###############
   0.40   58 0.439720 0.130749 #############
   0.50   58 0.544373 0.175591 #############
   0.60   25 0.660243 0.198646 ######
   0.70   21 0.747427 0.221001 #####
   0.80   18 0.846808 0.244567 ####
   0.90   14 0.930313 0.268002 ###

Here's the histogram using '-f' (the Robinson Fisher algorithm):

[relson at osage cvs]$ bogofilter -C -f -F -vv < doc/bogofilter-faq.html
X-Bogosity: No, tests=bogofilter, spamicity=0.000000, version=0.12.2
    int  cnt   prob  spamicity histogram
   0.00  226 0.018153 0.008576 ################################################
   0.10   65 0.160957 0.030530 ##############
   0.20   61 0.249396 0.061397 #############
   0.30   69 0.355610 0.108162 ###############
   0.40    0 0.000000 0.108162
   0.50    0 0.000000 0.108162
   0.60   25 0.660243 0.144058 ######
   0.70   21 0.747427 0.177642 #####
   0.80   18 0.846808 0.211795 ####
   0.90   14 0.930313 0.244536 ###

Be aware that -r (the Robinson geometric mean algorithm) and -f (the 
Robinson Fisher algorithm) give different _numeric_ values but tend to give 
the same Yes/No classification.  The Fisher modification to the Robinson 
algorithm adds a chi square calculation that produces a "confidence level" 
indicating bogofilter's "sureness" of a right answer.  Briefly, if there a 
lot of tokens contributing to the score and the score is "different enough" 
from 0.5, the Fisher score will be 0.00000 or 1.00000.

You might find it informative to look at the histogram for 
bogofilter-faq.html.  As you can see, with my wordlists the message has 
many more low scoring (hammish) token than it has high scoring (spammish) 
tokens.

I can, perhaps, take a look at the message you showed in your posting.  To 
do so, I need to have you update to 0.12.2 and convert the message to the 
msg-count format (using script training/bogolex.sh).  The msg-count format 
will give me the actual numbers from your wordlists for the tokens in the 
message.  It'd good if you'd also include the msg-count format for 
bogofilter-faq.html as I can compare that easily with what I have.

Hope this helps :-)

David