tuning bogofilter (was: bogofilter producing poor results)

David Relson relson at osagesoftware.com
Thu Nov 14 13:39:01 CET 2002


At 01:07 AM 11/14/02, William Ono wrote:

>For your amusement, here are the values that I settled on:
>
>#define MAX_PROB        0.9999f         // max probability value used
>#define MIN_PROB        0.0001f         // min probability value used
>#define ROBINSON_MIN_DEV        0.15f   // if nonzero, use characteristic 
>words
>#define ROBINSON_SPAM_CUTOFF    0.54f   // if it's spammier than this...
>#define ROBINSON_MAX_REPEATS    1       // cap on word frequency per message
>#define ROBS                    0.01f   // Robinson's s
>#define ROBX                    0.415f  // Robinson's x
>
>My ROBX turned out to be the same as yours, I think, but that's an
>honest coincidence.  But maybe the default value of 0.200 needs to be
>cranked up a bit?  I think changing this value had the greatest effect
>on the results, followed (strangely enough) by kicking up the MIN_DEV.
>Experimental results on a very small data set, so, as you said, YMMV.
>And of course, as you pointed out, this should have less impact on the
>results as the data set grows.

Hello William,

Glad to see you're making progress.  Using different values _does_ have an 
effect of the results produced.  'Tis good that you're doing the tests and 
reporting your findings.  I do have a couple of observations on the values 
you're using and how they interact.

First, MAX_PROB and MIN_PROB only affect results when using the Graham 
method and the various ROBxxx parameters are just for the Robinson 
method.  Since you're using the Robinson method, the MAX_PROB and MIN_PROB 
experiments were extra work.

Second, my experiments indicate the major effect of ROBX is in sssigning 
spamicity to an unknown word, i.e. one not in the spam wordlist and not in 
the ham wordlist.  ROBINSON_MIN_DEV causes the code to ignore words close 
to EVEN_ODDS, a.k.a. 0.50.  Since (EVEN_ODDS - ROBX) < MIN_DEV, the result 
is to ignore the unknown words.  You can see this by using verbose mode to 
display the histogram.  Here's a test script for you to try:

         ( echo "min_dev=0.15" ; echo "robx=0.415") | tee .bogofilter.cf ; 
./bogofilter -r -vv  < $1
         ( echo "min_dev=0.0"  ; echo "robx=0.415") | tee .bogofilter.cf ; 
./bogofilter -r -vv  < $1
         ( echo "min_dev=0.0"  ; echo "robx=0.200") | tee .bogofilter.cf ; 
./bogofilter -r -vv  < $1

Run it as "test.sh message_file"

David

Note: You need to be using the newest available code from the cvs 
repository for this test to work from a command line. bogofilter-0.8.0 
doesn't have the ability to set robs and robx in the config file.  If 
you're using 0.8.0, you'll need to change the values of ROBS and ROBX in 
the source code.






More information about the Bogofilter mailing list