Bogofilter seems to be maturing nicely....thanks to everyone who worked on making this a mature tool.
Nick Simicich
njs at scifi.squawk.com
Sun Dec 8 09:45:55 CET 2002
I read an article in risks about bogofilter and determined to try it some
time ago. My goal was to extend the life of this poor P-90 running Redhat
5, with numerous upgrades that handles just a couple of little things for
me. I was no longer able to expand my spam methodology, which was to run
hand constructed regexps against mail to put it in a spambox -- the
spamboxing is driven by a long and very personal maildrop program. I
glance through the spambox every couple of days. If something is not spam,
I relay it to the machine with the UI that I actually read my mail on, and
if it is spam, I send it on to spamcop where I submit reports. This is
done with a short perl script and it is much faster than trying to drive
spamcop in any other manner. It also allows me to deal with a large amount
of spam in a very short time without actually reading it - I have the perl
program extract "indicators". If I find that I am making a decision based
on the same indicator more than once a day, it gets added to the regexp
list so that the decision is made automatically.
The first attempt to get bogofilter running was a complete disaster, and I
think I may have even said here that I thought that the overall readiness
of the tool was overstated in the risks article. Others said the same
thing, and I am glad that they had the time to work on the tool and improve
it. A lot.
This second attempt was not nearly as bad. The only dependency was the new
db packages. Porting that original package that was used to deal with named
arrays was very painful. Compiling the latest version of Berkeley db (the
only thing I had to do in this case) was painless.
Bogifilter seems to be running pretty well. I took a couple hours to train
it on the last day or so of mail that I had gotten. I added a
xfilter "$BOGOFILTER -p -e -u -l"
early in the process, which tags the mail, and also puts the words into
either "spam" or "nonspam" as it classifies the mail.
Anything that bogofilter classifies as spam is checked against whitelist
expressions. If it is not in the whitelist, it is treated as spam. If it
is not pre-classified as spam, it is then checked against my set of
expressions.
Running the message against the expressions is very expensive, but fewer
and fewer have to be, because more and more are being caught, as bogofilter
seems to be learning to classify messages as my expressions would.
In my environment, that old version of bogofilter was pretty
hopeless. This version was about as painless as installing a tool that
gets into the guts of mail delivery can get.
I have some minor nuances regarding misclassification left to iron out. If
a message ends up being passed into my personal in box, and I want to say,
"no this is really spam", I have a queue of messages on disk. I grep
through that queue, and I move the message into the "might be spam" queue
which is where a message that is "treated as spam" is placed. Later, I
have a perl script that allows me to quickly look through those messages
and decide, "spam", spam with fraud, send to FTC, spam with stock fraud,
FTC and SEC, etc, report to spamcop, or even "misclassified as spam,
requeue to main mailbox and whitelist". The point is that I need to figure
out in that process if the message has already been classified as spam, and
if it has, whether it needs to me -N'd if I am queueing it back to the mail
server with a whitelist flag, or if it needs to be -S'd because it was
manually moved to backup from reporting. I think I have that correct now.
These are relative nits. The point is that the databasing seems to be
working, and the tool is working about as well as can be expected, and the
overall load on the server is lower, and it is pushing mail through
noticably faster because much of the spam is bypassing the regexp checking.
I am well pleased, despite the bug reports, and I want to say thanks for
the massive improvement. This tool has turned completely around, and I am
considering using a version with a different wordlist to try and sort
through stuff from majordomo that I want to look at vs. stuff I do not want
to look at.
--
If you doubt that magnet therapy works, I put to you this observation: When
refrigerators were first invented, in the 1940s, they were rather
unreliable, but then they became significantly more reliable. The basic
design of the refrigerator did not change, and we all know that quality was
important back then, so I doubt that newer refrigerators are made better.
Refrigerators have become more reliable because of the rise of the
refrigerator magnet.
Nick Simicich - njs at scifi.squawk.com
More information about the Bogofilter
mailing list