Robinson algorithm experiences

Wed Nov 27 05:40:03 CET 2002

At 11:25 PM 11/26/02, Shane Wegner wrote:

>Hi,
>
>I just upgraded to BogoFilter 0.9.0 with the Robinson
>algorithm being the default.  I rebuilt the spam and
>nonspam databases and though the spam being caught
>did increase, as did false positives.  In particular, the
>original graham algorithm was pretty good at determining a
>nonspam email with an html attachment, the Robinson
>algorithm seems to clacify any html email as spam.  Much of
>my spam is indeed html email but Robinson must give the
>html entities more weight.
>
>Regards,
>Shane

Shane,

bogofilter's lexer checks for most common html tags and "eats" them, so the 
spam calculation never sees those tags.

Graham computes the spam index for each token in the message, picks the 15 
tokens with values furthest from 0.500, then computes the spam index.

Robinson uses the same tokenizer and gets the exact same tokens as Graham, 
computes the spam idex for each one, then computes the spam index based on 
_all_ the tokens.

If you want to learn more about what's going on, run bogofilter with 
varying levels of debugging messages turned on (through use of the '-v' 
(verbose) command line option).  Some informative combinations are:

Graham: bogofilter -g -vv <message -- prints the 15 tokens and their info
Robinson:       bogofilter -r -vv <message -- prints a histogram of the 
tokens evaluated
                 bogofilter -r -vvv <message - prints _all_ the tokens 
evaluated and their info

If you're curious about wordlist values for some tokens, run bogoutil and 
use the -p and -w options, as in

         bogoutil -p -w BOGOFILTER_DIR first_token second third

Hope this helps.

David