Bogofilter accuracy plummets starting around March 10, 2010
Thomas Anderson
tanderson at orderamidchaos.com
Thu Apr 1 09:04:22 CEST 2010
I've seen some of these spams, but they've not been a major nuisance.
Mostly they come in as spam and some as unsure. It's important to have
your robx between your ham and spam cutoffs and preferably within your
min_dev... this way new tokens are graded neutrally and don't
immediately affect classifications. I also run spamitarium on my emails
prior to bogofilter, and this often tags the header with an SPF failure
or missing rDNS, which helps bump it into a spam classification. And
with training, I see them in my ham and unsure box less and less.
Tom
On 4/1/2010 1:35 AM, Jonathan Kamens wrote:
> Hi all,
>
> I assume I'm not the only one who has noticed that bogofilter's accuracy
> has plummeted starting around March 10?
>
> In the past 60 days, I've averaged 816 spam messages per day, with a
> peak of 1,254 in a single day. In that same period, my bogofilter block
> rate was 98.4% before March 10, but only 84.8% between March 10 and
> March 31. Yowza!
>
> Check out the graphs illustrating this on my home page
> <http://stuff.mit.edu/%7Ejik/#spam>.
>
> I tried returning bogofilter from my large, accurate corpus of recent
> spam and ham, and it didn't help.
>
> The culprit appears to be almost entirely messages such as this one
> <http://jik3.kamens.brookline.ma.us/%7Ejik/sample-spam.eml> (on the Web
> instead of embedded here so that I don't mess up people's filters), each
> of which contains, below the actual spam payload, a sequence of random
> text snippets on many different topics.
>
> These messages are coming from many different IP addresses, so it would
> seem that they're being generated by a botnet.
>
> I did a quick statistical analysis of a small subset of these messages
> that I've received, 35 of them, and discovered that these 35 messages
> contained 10,860 unique words, of which over 68% appeared in only one of
> the messages, 81% appeared in one or two messages, 87% appeared in 1-3
> messages, 90% appeared in 1-4 messages, and 98% appeared in less than
> half of the messages. This would seem to indicate that the text
> snippets being used by the spam generator vary widely and are thus
> likely to hit upon keywords that previously occurred in ham.
>
> It would seem that somebody has figured out how to do a pretty good job
> of outsmarting Bayesian filters.
>
> What can we do about it?
>
> Jonathan Kamens
>
> _______________________________________________
> Bogofilter mailing list
> Bogofilter at bogofilter.org
> http://www.bogofilter.org/mailman/listinfo/bogofilter
>
>
More information about the Bogofilter
mailing list