Bogofilter accuracy plummets starting around March 10, 2010

Jonathan Kamens jik at kamens.brookline.ma.us
Wed Apr 7 19:50:07 CEST 2010


It looks like I've managed to get the spam outbreak under control in two
ways, as suggested by the announcement I just sent out about the new version
of bogofilter-milter.pl:

 

1. I've configured bogofilter-milter.pl to use Subject line matching to
ignore messages that contain, e.g., spam summaries that aren't spam but have
lots of spam keywords in them.  This helps the bogofilter word list stays
more "pure" and accurate.

 

2. I am now pre-processing email with spamitarium.pl as suggested by Thomas
Anderson before feeding it to bogofilter.  I satisfied my concern about
losing header information by only modifying the message that gets fed into
bogofilter, not the message that ends up in my mailbox, and by not doing
non-standard header field filtering. 

 

I also fed my entire old ham and spam corpuses through spamitarium and then
wiped out my word list and recreated it from scratch using the new corpuses.

 

In the day since I made these changes, I've gone from 79% accuracy to
<http://stuff.mit.edu/~jik/#spam> >99% accuracy, so either all the spam of
the type that that was confusing bogofilter has suddenly stopped (unlikely!)
, or these changes were quite successful.

 

Thanks to everyone here for your advice!

 

  jik




More information about the Bogofilter mailing list