Bogofilter accuracy plummets starting around March 10, 2010
Jonathan Kamens
jik at kamens.brookline.ma.us
Thu Apr 1 07:35:57 CEST 2010
Hi all,
I assume I'm not the only one who has noticed that bogofilter's accuracy
has plummeted starting around March 10?
In the past 60 days, I've averaged 816 spam messages per day, with a
peak of 1,254 in a single day. In that same period, my bogofilter block
rate was 98.4% before March 10, but only 84.8% between March 10 and
March 31. Yowza!
Check out the graphs illustrating this on my home page
<http://stuff.mit.edu/%7Ejik/#spam>.
I tried returning bogofilter from my large, accurate corpus of recent
spam and ham, and it didn't help.
The culprit appears to be almost entirely messages such as this one
<http://jik3.kamens.brookline.ma.us/%7Ejik/sample-spam.eml> (on the Web
instead of embedded here so that I don't mess up people's filters), each
of which contains, below the actual spam payload, a sequence of random
text snippets on many different topics.
These messages are coming from many different IP addresses, so it would
seem that they're being generated by a botnet.
I did a quick statistical analysis of a small subset of these messages
that I've received, 35 of them, and discovered that these 35 messages
contained 10,860 unique words, of which over 68% appeared in only one of
the messages, 81% appeared in one or two messages, 87% appeared in 1-3
messages, 90% appeared in 1-4 messages, and 98% appeared in less than
half of the messages. This would seem to indicate that the text
snippets being used by the spam generator vary widely and are thus
likely to hit upon keywords that previously occurred in ham.
It would seem that somebody has figured out how to do a pretty good job
of outsmarting Bayesian filters.
What can we do about it?
Jonathan Kamens
More information about the Bogofilter
mailing list