When "meta" messages introduce noise into the bogofilter database

Jonathan Kamens jik at kamens.brookline.ma.us
Thu Apr 1 16:31:20 CEST 2010


I have identified a potential contributing factor to why my bogofilter 
is having so much trouble filtering the recent flood of 
Bayesian-avoidance spam, and I wanted to run it past you all to (a) see 
if you agree that it could be a significant factor and (b) ask for any 
suggestions for how to deal with it.

I administrator the STUMP moderation 'bot for several Usenet 
newsgroups.  Incoming submissions for one of them are filtered with 
bogofilter, and in addition, several technical rules are employed to 
block impermissible submissions (e.g., multipart or HTML, both of which 
are prohibited by the charter of the newsgroup).  Most of the 
submissions that are blocked by rules turn out to be spam, which means 
that I want to train bogofilter with them, but by the time the technical 
rules come into play, bogofilter has already passed the message.  
Therefore, when a message is blocked by a rule, I get a notification 
about it in my email, and this gives me the opportunity to review the 
message specifics, determine if in fact it was actually a spam message, 
and if so, retrain bogofilter to let it know.

The problem is that the notification I get in my email contains the 
Message-ID, From and Subject fields from the message in question.  Since 
the notification gets passed by bogofilter, all of the keywords in those 
fields end up getting classified as ham keywords, which means that 
bogofilter gets worse at recognizing those words as spam keywords in 
actual spam messages coming into my inbox.

Have others encountered a problem like this?  Any suggestions for how to 
avoid it?

Thanks,

Jonathan Kamens




More information about the Bogofilter mailing list