tuning bogofilter (was: bogofilter producing poor results)

Tue Nov 12 21:58:36 CET 2002

On 20021112 (Tue) at 0732:37 -0800, William Ono wrote:
>                                                                                
> So, if I leave the magic values alone, from the volume of email that I         
> receive it looks as though I should see better performance after               
> feeding in a total of about two years' worth of email.  Hmm.  I think          
> I'd best go re-read the Robinson paper with a pot of coffee and see what       
> I remember from my (very few) statistics and probabilities courses, and        
> get to tuning those magic values.                                              

This is likely to become a FAQ so here's a bit of an explanation that I
hope may help you and others interested in tuning Gary Robinson's f(w)
and S calculation.  Much of the substance is straight out of Gary's
paper, but I've tried to emphasize the practical effects:

Don't expect a miracular improvement from the magic ;)  With Robinson's
changes as implemented in bogofilter, there are four things to tune:

SPAM_CUTOFF is the threshold of "spamicity" above which the message is
deemed spam.  You move this up or down till you have the balance you
want between false positives and false negatives.  As the training set
grows and discrimination improves, you can edge SPAM_CUTOFF up a bit
since we usually want the absolute minimum false positives compatible
with not getting a flood of false negatives.

ROBS and ROBX (which Gary just calls s and x, and I will too to save
typing) work together.  A good starting value for x is the average of
  p(w) = badcount / (goodcount + badcount)
for every word in your training set that appears at least 10 times in
both bad and good wordlists (ie badcount >= 10 and goodcount >= 10).

Ok, I oversimplified: you have to scale the counts somehow.  If you had
exactly the same number of messages contributing to your good and bad
wordlists, the formula for p(w) given above would be ok; but we
actually have to use
  scalefactor = badlist_msgcount / goodlist_msgcount
  p(w) = badcount / (badcount + goodcount * scalefactor)
and average those p(w) values.

A good starting value for s is 0.001 (determined empirically), but you
might want to take the time to get a feel for what happens if you move
s up and down through a couple of orders of magnitude.  The value of x
is the f(w) value that a token will get if it has zero occurrences in
the training set.  The formula is
  f(w) = (s * x + badcount) / (s + badcount + goodcount * scalefactor)
and you can see that if both counts are 0, you get x; if both counts
are small, x will have an influence on f(w) that varies with the
magnitude of s.

The thought behind this calculation is that x is really a "first guess"
at what the presence of an unknown token means in terms of
spammishness.  With counts of zero, we have only that guess to go on,
so that's what we use.  But with counts that are small though nonzero,
the Graham p(w) (defined above) is likely to be unreliable, so we ought
to (and f(w) does) compromise between our "first guess" and what the
counts say.  Once the counts get to a decent size, x becomes
insignificant and f(w) becomes, in effect, the Graham p(w).

So s and x are only important when the counts are small, and the value
of s reflects what we think of as "small."  If s is large, then when
counts are small we trust our x value more than we do the p(w); if s is
small, we give more weight to p(w) and less to x.  But obviously,
tuning s and x is going to be of little use when your training set is
large enough that your messages have few new or little-known tokens.

MIN_DEV is the last of the four things you might want to play with.  By
default, it's zero; every token in the message contributes its
spammishness value f(w) to the final calculation.  We might save time
and even improve discrimination a bit if we ignore tokens with f(w)
values near 0.5, since those tokens obviously aren't making a great
difference to the outcome of the calculation.  MIN_DEV specifies how
different a value of f(w) has to be from 0.5 in order for that value to
be included in the calcualation of S, the "spamicity" of the message. 
I've played with MIN_DEV values of up to 0.4, where only words with f(w)
less than 0.1 or greater than 0.9 are taken into account.  In the end,
it seemed best to use all the words in the message; why throw away
information?  But ymmv.

Hope that helps more than it confuses... ;)
-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |