How to deal with extremely high spam levels

Tom Anderson tanderso at oac-design.com
Tue Jun 22 21:36:46 CEST 2004


From: "Bob Vincent" <bogofilter at bobvincent.org>
> Bogofilter is apparently designed for the situation where the number
> of spams per day roughly equals the number of non-spams per day.

Bob, I don't think that's true at all.  Perhaps that is true of bogotune,
but my ham:spam ratio is similar to yours, and bogofilter been working very
well.  Each and every day I get zero false positives, 2-3 false negatives
(mostly virii/bounces), and 3-4 spam unsures, with over 100 properly
filtered emails (mostly spam by about 30:1).  Manual tuning is a
trial-and-error process in which some familiarity with the theory behind
bogofilter helps immensely.  Basically it works like this:

1) start with a very high spam cutoff ~0.99, a substantial ham cutoff ~0.3,
a robx near but less than 0.5, a substantial robs ~0.2, and a min dev range
which encompasses your robx ~0.1 to ~0.2.  This setup will give you the
least false positives, it will bias slightly toward ham, and you should get
lots of unsures.
2) after you've received and registered a number of emails (using -u is
helpful as long as you make sure to promptly correct all misclassifications)
and you're comfortable that you're not getting any false positives, start
lowering your spam cutoff.  If you're not receiving any hams in your
unsures, you can start lowering your ham cutoff too.  You should probably
lower your ham cutoff to just a little above your highest scoring ham.
Lower your spam cutoff in increments of 0.1 to 0.5 as you gain confidence
that bogofilter will never classify a ham above that range.

My current values, after months of manual tweaking (that sounds like a lot,
but I only changed it like 5-6 times), are as follows:
robx=0.46, robs=0.2, min_dev=0.2, spam_cutoff=0.465, ham_cutoff=0.1

You can play around with the robx, robs, and min_dev values after you've
read what they're for and understand thoroughly what they do.  In most
cases, I would think that keeping conservative values such as mine should
work well.

I recommend you try out bfproxy
(http://orderamidchaos.com/bogofilter/bfproxy) for easily handling
registrations.  It works via a procmail recipe (or similar MDA setup) to
perform bogofilter functions via email.  You simply drag incorrect
classifications to a folder in your mail client, and then forward them all
as attachments in a single email to bfproxy occassionally.  Bfproxy will
also do recursive registrations, so that if after registering an email it
still doesn't classify correctly, it will register it again until it does.
This helps polarize your wordlist to your email quicker.  Some will argue
that could theoretically cause harm, but in practice it works wonderfully.

You could also try out spamitarium
(http://orderamidchaos.com/bogofilter/spamitarium), which allows you to do
some preformatting of emails before running through bogofilter.  This
includes stripping out nonstandard tags, validating received lines, and
adding ASNs to received lines.  This helps make it less ambiguous which
emails are ham or spam due to spammer tricks.  And the ASNs will make whole
regions of the internet more hammy or spammy (eg: Nigeria, etc) by grouping
IPs with their respective Autonomous Systems.

Tom




More information about the Bogofilter mailing list