Bogofilter accuracy plummets starting around March 10, 2010

Thu Apr 1 09:04:22 CEST 2010

I've seen some of these spams, but they've not been a major nuisance. 
Mostly they come in as spam and some as unsure.  It's important to have 
your robx between your ham and spam cutoffs and preferably within your 
min_dev... this way new tokens are graded neutrally and don't 
immediately affect classifications.  I also run spamitarium on my emails 
prior to bogofilter, and this often tags the header with an SPF failure 
or missing rDNS, which helps bump it into a spam classification.  And 
with training, I see them in my ham and unsure box less and less.

Tom

On 4/1/2010 1:35 AM, Jonathan Kamens wrote:
> Hi all,
>
> I assume I'm not the only one who has noticed that bogofilter's accuracy
> has plummeted starting around March 10?
>
> In the past 60 days, I've averaged 816 spam messages per day, with a
> peak of 1,254 in a single day.  In that same period, my bogofilter block
> rate was 98.4% before March 10, but only 84.8% between March 10 and
> March 31.  Yowza!
>
> Check out the graphs illustrating this on my home page
> <http://stuff.mit.edu/%7Ejik/#spam>.
>
> I tried returning bogofilter from my large, accurate corpus of recent
> spam and ham, and it didn't help.
>
> The culprit appears to be almost entirely messages such as this one
> <http://jik3.kamens.brookline.ma.us/%7Ejik/sample-spam.eml>  (on the Web
> instead of embedded here so that I don't mess up people's filters), each
> of which contains, below the actual spam payload, a sequence of random
> text snippets on many different topics.
>
> These messages are coming from many different IP addresses, so it would
> seem that they're being generated by a botnet.
>
> I did a quick statistical analysis of a small subset of these messages
> that I've received, 35 of them, and discovered that these 35 messages
> contained 10,860 unique words, of which over 68% appeared in only one of
> the messages, 81% appeared in one or two messages, 87% appeared in 1-3
> messages, 90% appeared in 1-4 messages, and 98% appeared in less than
> half of the messages.  This would seem to indicate that the text
> snippets being used by the spam generator vary widely and are thus
> likely to hit upon keywords that previously occurred in ham.
>
> It would seem that somebody has figured out how to do a pretty good job
> of outsmarting Bayesian filters.
>
> What can we do about it?
>
>     Jonathan Kamens
>
> _______________________________________________
> Bogofilter mailing list
> Bogofilter at bogofilter.org
> http://www.bogofilter.org/mailman/listinfo/bogofilter
>
>