tuning bogofilter (was: bogofilter producing poor results)
David Relson
relson at osagesoftware.com
Thu Nov 14 13:39:01 CET 2002
At 01:07 AM 11/14/02, William Ono wrote:
>For your amusement, here are the values that I settled on:
>
>#define MAX_PROB 0.9999f // max probability value used
>#define MIN_PROB 0.0001f // min probability value used
>#define ROBINSON_MIN_DEV 0.15f // if nonzero, use characteristic
>words
>#define ROBINSON_SPAM_CUTOFF 0.54f // if it's spammier than this...
>#define ROBINSON_MAX_REPEATS 1 // cap on word frequency per message
>#define ROBS 0.01f // Robinson's s
>#define ROBX 0.415f // Robinson's x
>
>My ROBX turned out to be the same as yours, I think, but that's an
>honest coincidence. But maybe the default value of 0.200 needs to be
>cranked up a bit? I think changing this value had the greatest effect
>on the results, followed (strangely enough) by kicking up the MIN_DEV.
>Experimental results on a very small data set, so, as you said, YMMV.
>And of course, as you pointed out, this should have less impact on the
>results as the data set grows.
Hello William,
Glad to see you're making progress. Using different values _does_ have an
effect of the results produced. 'Tis good that you're doing the tests and
reporting your findings. I do have a couple of observations on the values
you're using and how they interact.
First, MAX_PROB and MIN_PROB only affect results when using the Graham
method and the various ROBxxx parameters are just for the Robinson
method. Since you're using the Robinson method, the MAX_PROB and MIN_PROB
experiments were extra work.
Second, my experiments indicate the major effect of ROBX is in sssigning
spamicity to an unknown word, i.e. one not in the spam wordlist and not in
the ham wordlist. ROBINSON_MIN_DEV causes the code to ignore words close
to EVEN_ODDS, a.k.a. 0.50. Since (EVEN_ODDS - ROBX) < MIN_DEV, the result
is to ignore the unknown words. You can see this by using verbose mode to
display the histogram. Here's a test script for you to try:
( echo "min_dev=0.15" ; echo "robx=0.415") | tee .bogofilter.cf ;
./bogofilter -r -vv < $1
( echo "min_dev=0.0" ; echo "robx=0.415") | tee .bogofilter.cf ;
./bogofilter -r -vv < $1
( echo "min_dev=0.0" ; echo "robx=0.200") | tee .bogofilter.cf ;
./bogofilter -r -vv < $1
Run it as "test.sh message_file"
David
Note: You need to be using the newest available code from the cvs
repository for this test to work from a command line. bogofilter-0.8.0
doesn't have the ability to set robs and robx in the config file. If
you're using 0.8.0, you'll need to change the values of ROBS and ROBX in
the source code.
More information about the Bogofilter
mailing list