Maintaining a snappy bogofilter

David Relson relson at osagesoftware.com
Fri Apr 11 17:33:16 CEST 2003


At 11:07 AM 4/11/03, Mark Constable wrote:

>On Fri, 11 Apr 2003 11:45 pm, David Relson wrote:
> > ...
> > make a noticeable difference.  My wordlists contain everything back to Oct
> > 6 when I put bogofilter into production.  I have rebuilt the wordlists
> > several times.  The 0.7/0.8 database format change necessitated one of the
> > changes.  I also rebuilt sometime after switching from Graham to Robinson
> > (since they use different MAX_REPEATS values).
>
>Could you perhaps spare a sentence or two on how effective your setup
>appears to be for you ? (from someone who knows how to tweak bogofilter)

At the moment I have bogofilter-0.11.1.8 running on my mail server.  It's 
the executable from the 0.11.1.8 i86.rpm, not a development version, i.e. 
it's the same as on SourceForge.

My biggest difference is that I have ham_cutoff=0.1 in 
/etc/bogofilter.cf.  This signals bogofilter to run in tri-state mode.  My 
mail is thus classified as Ham/Spam/Unsure.  Currently 200-300 messages per 
day come in with 40% or so being spam.  Everything that bogofilter 
classifies as spam _is_ spam.  About 5-10% of each day's messages are 
classified as unsure.

The unsure messages have spam scores that pretty much cover the range 
between ham_cutoff and spam_cutoff.  Occasionally there is a spam message 
that scores between 0.1 and 0.2 and occasionally there is a ham message up 
above 0.90.  These occasions are infrequent, but they _do_ occur.  This 
range of scores also means that I can't increase ham_cutoff or decrease 
spam_cutoff without risk of increased numbers of false positives and/or 
false negatives.

I also use "-u" to autoupdate the wordlists when incoming messages are 
classified as spam or as ham.  I manually classify the messages that 
bogofilter is unsure about and let a cronjob feed them into the wordlists.

At the moment I don't mind the quantity of unsures received each day.  Some 
experiments indicate that I could increase my min_dev from 0.1 to 0.2 or 
0.3 and also increase my spam_cutoff from 0.95 to 0.98 or 0.99, but I 
haven't done that.

>ie; I get about 100 mpd with approx 95% being spam and an average of two
>spams per day in my regular mail and the odd HTML markedup genuine message
>in amongst my spam. Still using 0.8 for maybe 6 months with zero tweaking.

I haven't provided any exact figures.  I'm satisfied with the performance, 
so why give figures?  Also, I know that the numbers and settings _do_ vary 
from site to site.

I would recommend updating from 0.8.  The Robinson-Fisher algorithm _does_ 
do a better job than the older Graham algorithm which is in 0.8.  Also, 
bogofilter now understands multipart mime messages and decodes base64, 
quoted-printable, and uuencoded text and does some useful processing of 
html.  It also has some speed improvements.  All-in-all there have been a 
_lot_ of changes since 0.8.

David





More information about the Bogofilter mailing list