When "meta" messages introduce noise into the bogofilter database
Jonathan Kamens
jik at kamens.brookline.ma.us
Thu Apr 1 16:31:20 CEST 2010
I have identified a potential contributing factor to why my bogofilter
is having so much trouble filtering the recent flood of
Bayesian-avoidance spam, and I wanted to run it past you all to (a) see
if you agree that it could be a significant factor and (b) ask for any
suggestions for how to deal with it.
I administrator the STUMP moderation 'bot for several Usenet
newsgroups. Incoming submissions for one of them are filtered with
bogofilter, and in addition, several technical rules are employed to
block impermissible submissions (e.g., multipart or HTML, both of which
are prohibited by the charter of the newsgroup). Most of the
submissions that are blocked by rules turn out to be spam, which means
that I want to train bogofilter with them, but by the time the technical
rules come into play, bogofilter has already passed the message.
Therefore, when a message is blocked by a rule, I get a notification
about it in my email, and this gives me the opportunity to review the
message specifics, determine if in fact it was actually a spam message,
and if so, retrain bogofilter to let it know.
The problem is that the notification I get in my email contains the
Message-ID, From and Subject fields from the message in question. Since
the notification gets passed by bogofilter, all of the keywords in those
fields end up getting classified as ham keywords, which means that
bogofilter gets worse at recognizing those words as spam keywords in
actual spam messages coming into my inbox.
Have others encountered a problem like this? Any suggestions for how to
avoid it?
Thanks,
Jonathan Kamens
More information about the Bogofilter
mailing list