Maintaining a snappy bogofilter
David Relson
relson at osagesoftware.com
Fri Apr 11 17:33:16 CEST 2003
At 11:07 AM 4/11/03, Mark Constable wrote:
>On Fri, 11 Apr 2003 11:45 pm, David Relson wrote:
> > ...
> > make a noticeable difference. My wordlists contain everything back to Oct
> > 6 when I put bogofilter into production. I have rebuilt the wordlists
> > several times. The 0.7/0.8 database format change necessitated one of the
> > changes. I also rebuilt sometime after switching from Graham to Robinson
> > (since they use different MAX_REPEATS values).
>
>Could you perhaps spare a sentence or two on how effective your setup
>appears to be for you ? (from someone who knows how to tweak bogofilter)
At the moment I have bogofilter-0.11.1.8 running on my mail server. It's
the executable from the 0.11.1.8 i86.rpm, not a development version, i.e.
it's the same as on SourceForge.
My biggest difference is that I have ham_cutoff=0.1 in
/etc/bogofilter.cf. This signals bogofilter to run in tri-state mode. My
mail is thus classified as Ham/Spam/Unsure. Currently 200-300 messages per
day come in with 40% or so being spam. Everything that bogofilter
classifies as spam _is_ spam. About 5-10% of each day's messages are
classified as unsure.
The unsure messages have spam scores that pretty much cover the range
between ham_cutoff and spam_cutoff. Occasionally there is a spam message
that scores between 0.1 and 0.2 and occasionally there is a ham message up
above 0.90. These occasions are infrequent, but they _do_ occur. This
range of scores also means that I can't increase ham_cutoff or decrease
spam_cutoff without risk of increased numbers of false positives and/or
false negatives.
I also use "-u" to autoupdate the wordlists when incoming messages are
classified as spam or as ham. I manually classify the messages that
bogofilter is unsure about and let a cronjob feed them into the wordlists.
At the moment I don't mind the quantity of unsures received each day. Some
experiments indicate that I could increase my min_dev from 0.1 to 0.2 or
0.3 and also increase my spam_cutoff from 0.95 to 0.98 or 0.99, but I
haven't done that.
>ie; I get about 100 mpd with approx 95% being spam and an average of two
>spams per day in my regular mail and the odd HTML markedup genuine message
>in amongst my spam. Still using 0.8 for maybe 6 months with zero tweaking.
I haven't provided any exact figures. I'm satisfied with the performance,
so why give figures? Also, I know that the numbers and settings _do_ vary
from site to site.
I would recommend updating from 0.8. The Robinson-Fisher algorithm _does_
do a better job than the older Graham algorithm which is in 0.8. Also,
bogofilter now understands multipart mime messages and decodes base64,
quoted-printable, and uuencoded text and does some useful processing of
html. It also has some speed improvements. All-in-all there have been a
_lot_ of changes since 0.8.
David
More information about the Bogofilter
mailing list