Bogofilter accuracy plummets starting around March 10, 2010

Jonathan Kamens jik at kamens.brookline.ma.us
Thu Apr 1 05:35:57 UTC 2010


Hi all,

I assume I'm not the only one who has noticed that bogofilter's accuracy 
has plummeted starting around March 10?

In the past 60 days, I've averaged 816 spam messages per day, with a 
peak of 1,254 in a single day.  In that same period, my bogofilter block 
rate was 98.4% before March 10, but only 84.8% between March 10 and 
March 31.  Yowza!

Check out the graphs illustrating this on my home page 
<http://stuff.mit.edu/%7Ejik/#spam>.

I tried returning bogofilter from my large, accurate corpus of recent 
spam and ham, and it didn't help.

The culprit appears to be almost entirely messages such as this one 
<http://jik3.kamens.brookline.ma.us/%7Ejik/sample-spam.eml> (on the Web 
instead of embedded here so that I don't mess up people's filters), each 
of which contains, below the actual spam payload, a sequence of random 
text snippets on many different topics.

These messages are coming from many different IP addresses, so it would 
seem that they're being generated by a botnet.

I did a quick statistical analysis of a small subset of these messages 
that I've received, 35 of them, and discovered that these 35 messages 
contained 10,860 unique words, of which over 68% appeared in only one of 
the messages, 81% appeared in one or two messages, 87% appeared in 1-3 
messages, 90% appeared in 1-4 messages, and 98% appeared in less than 
half of the messages.  This would seem to indicate that the text 
snippets being used by the spam generator vary widely and are thus 
likely to hit upon keywords that previously occurred in ham.

It would seem that somebody has figured out how to do a pretty good job 
of outsmarting Bayesian filters.

What can we do about it?

   Jonathan Kamens



More information about the Bogofilter mailing list