Questions about spamicity

Fri May 30 00:56:07 CEST 2003

At 06:25 PM 5/29/03, Michael Rensing wrote:
>Does it make sense for a message to have spamicity=0.000000? That's
>what's getting put into my message headers. As in:
>
>X-Bogosity:  No, tests=bogofilter, spamicity=0.000000, version=0.13.2.1
>
>It seems to me that for a statistical method, there should virtually
>never be a perfect 0 or 1 for a rating. However, that's what I'm getting
>for all of my messages.

Michael,

0.00000 is perfectly fine.  The calculation is done in several phases.

First, for each word look it up in goodlist.db and spamlist.db to get its 
non-spam and spam scores and compute the word's spam index.  For any given 
word, this can range from 0 to 1 (inclusive).  The exact value depends on 
the word's usage in good and spam messages.  The Robinson formula uses a 
value (robx, default of 0.415) for words not previously seen aand the robs 
value (default 0.01) for words seen before.  Read the Robinson article 
referenced in the FAQ for details.

Second, using all words that differ from EVEN_ODDS, i.e. 1/2, by more than 
min_dev (default is 0.1) compute the spam score for the message.  This 
compute step is the bayesian part of the calculation.  The result is a 
number between 0 and 1, but only rarely is it exactly 0 or 1.

Third, apply the Fisher algorithm which does a chi-square test using the 
result of step 2 and the number of tokens.  This gives a confidence 
measure, i.e. tells how likely the result is to be ham or spam (given the 
score and the token count).  Again the value is rarely exactly 0 or 1, but 
can be very, very close like 1e-16.  Bogofilter can print out the 
difference  (from 0 for ham and from 1 for spam) in scientific 
notation.  If you want to have the values printed this way, look at the 
formatting section of /etc/bogofilter.cf

>When I run bogofilter -M -v against my spam mailbox, only a few have a
>non-zero spamicity. Any ideas what's going on? Do I need to reset the
>database somehow? If so, how?

You might want to look at the calculation results in greater depth.  Using 
"-vvv" with bogofilter, it will display the tokens and their counts and 
probabilities.  If you find words that have scores different from what 
you'd expect, then you may have incorrectly registered some spam messages 
as ham (or vice versa).  If that's so, you'll have to carefully sort your 
saved emails into ham and spam and create new wordlists.

David