Is bogotune helpful?

Tue Dec 2 13:50:42 CET 2003

Hello Bill,

Good questions about bogotune!  Hopefully I can give you some more info.

To answer the basic question, bogotune _is_ useful.  Using the
parameters it recommended (robs, robx, min_dev, spam_cutoff, and
ham_cutoff), 99% of my spam and spam are correctly classified as spam
(ham) and the remaining 1% of messages go into the "unsure" group.  It's
very a vary rare occurrence when there's a false positive or a false
negative.  My false  negatives are often messages like "hire your
offshore programmers, here" rather than the offensive garbage that
comprises most of the spam received.

Bogotune, as a perl script, has been distributed with bogofilter for
quite a while.  Recently it's been converted to C which enables it to
run much, much faster.  Since bogotune does multiple passes over the
data (often referred to as the "test messages") and can't do its job
unless there are a significant number of messages, it is a very cpu
intensive task.

At this point in time (Tuesday, 2 December 2003), bogotune is undergoing
a tuning, tweaking, and polishing process.  Greg and I are testing it
with several largish test sets (approx 40,000 messages divided between
spam and non-spam) to determine the best parameters for bogotune itself.

Having proper parameters for bogotune is important because problems
occur in the tuning process when cutoff values are too large (above
0.97) or too small (0.5 or below).  With the perl version, polishing
wasn't feasible because it's just too dang slow!  With Cbogotune (as
Greg and I refer to it), there's enough speed that the polishing process
becomes feasible.

When tuning, the quality of the data set (the non-spam and spam message
sets) is very important.  Incorrectly classified messages (ham in the
spam set or spam in the ham set) can muck up the results.   So, after
reading the wordlist and the messages, bogotune does a scoring pass over
the spam and non-spam data sets and check how many aberrant results
(very low scoring spam and very high scoring ham) are present.   If it's
more than a small percentage, bogotune will complain and quit.  When
this happens, check your data to make sure it's all classified correctly
(because bogotune judges it is not).  If your data _is_ correct and
bogotune still complains (and quits), you can use the force option, i.e.
"-F".  Rather than using -F, it's probably best to remove the messages
causing the complaint.

As to data set size, bogotune _can_ run with as few as 500 each of spam
and non-spam.   It will be much happier if you have at least several
thousand of each, and even more is better!  It's not _necessary_ to use
messages that are not in the wordlist, but it's the right thing to do. 
Remember what bogotune is doing is using info from messages past (as
stored in the wordlist) to create parameters to catch spam in messages
from the future (that have not yet arrived).  

Bogotune can be run without a wordlist using the "-D" (no database)
option.  It reads in all the messages, splits them in half, and uses the
first half to build a wordlist (in ram) and uses the second half for
tuning.  This process is reasonably memory efficient, but too many
messages and too little ram can still cause problems.

Since Cbogotune is still being tweaked, I'd recommend waiting 'til the
bogofilter-0.15.10 release.

Hope this helps!

David

P.S.  If you're up for compiling source code and being a beta tester,
let me know.