Bogofilter seems to be maturing nicely....thanks to everyone who worked on making this a mature tool.

Nick Simicich njs at scifi.squawk.com
Sun Dec 8 09:45:55 CET 2002


I read an article in risks about bogofilter and determined to try it some 
time ago. My goal was to extend the life of this poor P-90 running Redhat 
5, with numerous upgrades that handles just a couple of little things for 
me.  I was no longer able to expand my spam methodology, which was to run 
hand constructed regexps against mail to put it in a spambox -- the 
spamboxing is driven by a long and very personal maildrop program.  I 
glance through the spambox every couple of days.  If something is not spam, 
I relay it to the machine with the UI that I actually read my mail on, and 
if it is spam, I send it on to spamcop where I submit reports.  This is 
done with a short perl script and it is much faster than trying to drive 
spamcop in any other manner.  It also allows me to deal with a large amount 
of spam in a very short time without actually reading it - I have the perl 
program extract "indicators".  If I find that I am making a decision based 
on the same indicator more than once a day, it gets added to the regexp 
list so that the decision is made automatically.

The first attempt to get bogofilter running was a complete disaster, and I 
think I may have even said here that I thought that the overall readiness 
of the tool was overstated in the risks article.  Others said the same 
thing, and I am glad that they had the time to work on the tool and improve 
it. A lot.

This second attempt was not nearly as bad. The only dependency was the new 
db packages. Porting that original package that was used to deal with named 
arrays was very painful.  Compiling the latest version of Berkeley db (the 
only thing I had to do in this case) was painless.

Bogifilter seems to be running pretty well.  I took a couple hours to train 
it on the last day or so of mail that I had gotten.  I added a

         xfilter "$BOGOFILTER -p -e -u -l"

early in the process, which tags the mail, and also puts the words into 
either "spam" or "nonspam" as it classifies the mail.

Anything that bogofilter classifies as spam is checked against whitelist 
expressions.  If it is not in the whitelist, it is treated as spam.  If it 
is not pre-classified as spam, it is then checked against my set of 
expressions.

Running the message against the expressions is very expensive, but fewer 
and fewer have to be, because more and more are being caught, as bogofilter 
seems to be learning to classify messages as my expressions would.

In my environment, that old version of bogofilter was pretty 
hopeless.  This version was about as painless as installing a tool that 
gets into the guts of mail delivery can get.

I have some minor nuances regarding misclassification left to iron out.  If 
a message ends up being passed into my personal in box, and I want to say, 
"no this is really spam", I have a queue of messages on disk.  I grep 
through that queue, and I move the message into the "might be spam" queue 
which is where a message that is "treated as spam" is placed.  Later, I 
have a perl script that allows me to quickly look through those messages 
and decide, "spam", spam with fraud, send to FTC, spam with stock fraud, 
FTC and SEC, etc, report to spamcop, or even "misclassified as spam, 
requeue to main mailbox and whitelist". The point is that I need to figure 
out in that process if the message has already been classified as spam, and 
if it has, whether it needs to me -N'd if I am queueing it back to the mail 
server with a whitelist flag, or if it needs to be -S'd because it was 
manually moved to backup from reporting.  I think I have that correct now.

These are relative nits.  The point is that the databasing seems to be 
working, and the tool is working about as well as can be expected, and the 
overall load on the server is lower, and it is pushing mail through 
noticably faster because much of the spam is bypassing the regexp checking. 
I am well pleased, despite the bug reports, and I want to say thanks for 
the massive improvement.  This tool has turned completely around, and I am 
considering using a version with a different wordlist to try and sort 
through stuff from majordomo that I want to look at vs. stuff I do not want 
to look at.

--
If you doubt that magnet therapy works, I put to you this observation: When 
refrigerators were first invented, in the 1940s, they were rather 
unreliable, but then they became significantly more reliable. The basic 
design of the refrigerator did not change, and we all know that quality was 
important back then, so I doubt that newer refrigerators are made better. 
Refrigerators have become more reliable because of the rise of the 
refrigerator magnet.
Nick Simicich - njs at scifi.squawk.com



More information about the Bogofilter mailing list