Questions about spamicity
David Relson
relson at osagesoftware.com
Fri May 30 00:56:07 CEST 2003
At 06:25 PM 5/29/03, Michael Rensing wrote:
>Does it make sense for a message to have spamicity=0.000000? That's
>what's getting put into my message headers. As in:
>
>X-Bogosity: No, tests=bogofilter, spamicity=0.000000, version=0.13.2.1
>
>It seems to me that for a statistical method, there should virtually
>never be a perfect 0 or 1 for a rating. However, that's what I'm getting
>for all of my messages.
Michael,
0.00000 is perfectly fine. The calculation is done in several phases.
First, for each word look it up in goodlist.db and spamlist.db to get its
non-spam and spam scores and compute the word's spam index. For any given
word, this can range from 0 to 1 (inclusive). The exact value depends on
the word's usage in good and spam messages. The Robinson formula uses a
value (robx, default of 0.415) for words not previously seen aand the robs
value (default 0.01) for words seen before. Read the Robinson article
referenced in the FAQ for details.
Second, using all words that differ from EVEN_ODDS, i.e. 1/2, by more than
min_dev (default is 0.1) compute the spam score for the message. This
compute step is the bayesian part of the calculation. The result is a
number between 0 and 1, but only rarely is it exactly 0 or 1.
Third, apply the Fisher algorithm which does a chi-square test using the
result of step 2 and the number of tokens. This gives a confidence
measure, i.e. tells how likely the result is to be ham or spam (given the
score and the token count). Again the value is rarely exactly 0 or 1, but
can be very, very close like 1e-16. Bogofilter can print out the
difference (from 0 for ham and from 1 for spam) in scientific
notation. If you want to have the values printed this way, look at the
formatting section of /etc/bogofilter.cf
>When I run bogofilter -M -v against my spam mailbox, only a few have a
>non-zero spamicity. Any ideas what's going on? Do I need to reset the
>database somehow? If so, how?
You might want to look at the calculation results in greater depth. Using
"-vvv" with bogofilter, it will display the tokens and their counts and
probabilities. If you find words that have scores different from what
you'd expect, then you may have incorrectly registered some spam messages
as ham (or vice versa). If that's so, you'll have to carefully sort your
saved emails into ham and spam and create new wordlists.
David
More information about the Bogofilter
mailing list