Is bogotune helpful?

Bill Wohler wohler at newt.com
Tue Dec 2 11:57:16 CET 2003


I just upgraded to bogofilter 0.15.8 and replaced my twin wordlists with
a single wordlist with a full rebuild. (I was happy to see the two 40 MB
lists shrink to a single 20 MB list.)

Once upon a time, I used the information in Greg's tuning paper to tweak
my values slightly. This time I thought bogotune would be able to do
this more easily. Instead of seeing suggested values of robx, robs,
min_dev, spam_cutoff, and ham_cutoff, I got the cryptic:

    Very few high-scoring nonspams in this data set.
    At target 26, cutoff is 0.000001.

What does this mean? 

I built the wordlist from 10000 ham and 10000 spam messages and used the
same input files for bogotune.

I had commented out my old settings in .bogofilter.cf except for:

    ham_cutoff = 0.25
    spam_cutoff = 0.51

[one google search later...]

Ah ha! I discovered Greg's bogotune README which hasn't yet made it to
the Debian 0.15.8 release which says that the testing messages should
not be in the training set. So allow me to suggest a few bug reports in
this message:

1. Update the bogotune man page to read like the README; that is, to say
   that the 500 testing message must not be in the training set.

2. Replace the cryptic "At target 26, cutoff is 0.00000.1" with
   something more useful like "Testing set too similar to training set
   to draw any conclusions. Messages in testing set must not be in
   wordlist.db."

3. bogotune ignored my BOGOFILTER_DIR variable; I had to specify the -d
   option explicitly.

Now I just need to wait a few hours to get 500 spams for the testing set
and try again...

p.s. I was motivated to upgrade since my spam_cutoff was set to 0.50
since a *ton* of spam was .50something and unfortunately a few false
positives came in at the .50something range too. I'm hoping the new
version plus the tuning will be able to separate the spam and ham a bit
better.

-- 
Bill Wohler <wohler at newt.com>  http://www.newt.com/wohler/  GnuPG ID:610BD9AD
Maintainer of comp.mail.mh FAQ and MH-E. Vote Libertarian!
If you're passed on the right, you're in the wrong lane.




More information about the Bogofilter mailing list