How do I filter out spam that turns up on mailing lists?

David Relson relson at osagesoftware.com
Tue Jan 8 00:14:31 CET 2008


On Mon, 7 Jan 2008 21:35:35 +0100
Nigel Henry wrote:

> On Monday 07 January 2008 21:08, Tom Anderson wrote:
> > Nigel Henry wrote:
> > > Cutting to the chase. There has just been another batch of spam
> > > getting through Debian mail filters, and has turned up in my
> > > Debian mailbox, so it appears that bogofilter was not able to
> > > detect the Debian list spam when it processed all the incoming
> > > mail.
> > >
> > > Any suggestions on how to deal with mailing list spam?
> >
> > Yep, just keep training on it.  It will take some time for the
> > Debian list headers to become more neutral, thus allowing the spam
> > tokens to shine through.  You can try recursive training to speed
> > up the process -- that is, train the spam, test it's spamicity, and
> > train again if it's still too low, then repeat.  Bfproxy has an
> > option to do this automatically for you.
> >
> > Tom
> 
> Hi Tom. Ok I see where you're going. So would it be helpfull if I
> trained bogofilter with a load of genuine Debian ham mails, so as to
> compare the ham from spam?
> 
> I suppose also it would be a good idea to upgrade bogofilter. I've
> put this off as I didn't want to mess something up that is working
> ok, that is apart from the mailing list stuff.
> 
> Thanks for the reply.
> 
> Nigel.

Hello Nigel,

Upgrading bogofilter won't make a significant difference.
Bogofilter has been very stable and virtually bug free since the 1.0
release.  Changes to the algorithm and parsing have been minimal.

The problem you're having with spam on the debian list being
classified as Ham is due to bogofilter's having been trained that
debian messages are Ham.   If you score a spam message using
bogofilter's '-vvv' you'll see the scores assigned to the various
tokens.  Likely the debian mailing list tokens are scored very low
(meaning 'definitely ham') and these tokens outscore the "definitely
spam" tokens.  Training with additional debian list ham will weight
the messages even more towards ham.

What you _could_ do is create an ignore list with headers from the
debian list.  This would eliminate those tokens from the scoring
effectively telling bogofilter to score using only body tokens.

HTH,

David



More information about the Bogofilter mailing list