Training bogofilter with Spamassassin collected spam.

David Relson relson at osagesoftware.com
Mon Aug 9 14:14:44 CEST 2004


On 09 Aug 2004 07:00:55 -0400
Tom Anderson wrote:

> On Mon, 2004-08-09 at 06:50, Tom Anderson wrote:
> > On Sun, 2004-08-08 at 23:56, Christian Dysthe wrote:
> > > I have a large mbox with spam collected by Spamassassin. All this
> > > mail has  been altered like Spamassassin does it: Putting a spam
> > > warning text in the  body of the mail, and move the spam content
> > > to an attachment. Will it  cause any problems using this mbox to
> > > train Bogofilter?
> > 
> > Only that the SA header will be considered spammy.  If you only
> > train spam, but future hams also have a similar header, then you may
> > get false positives.  If you're not going to be using SA concurrent
> > with bogofilter in the future, then there shouldn't be any concern
> > with those headers being spammy, as nobody would intentionally add
> > spammy headers. Alternatively, to remove any possible problems, you
> > could run a bash script to strip out those headers before training.
> 
> Sorry, I misread your original message.  Not headers, the body, you're
> saying...  Yeah, that'll cause a problem.  Bogofilter doesn't look at
> attachment content.  You'll have to create (or find on the internet) a
> Spamassassin reverser script that will strip out the MIME stuff and
> just leave the original spam portion intact.
> 
> Tom

Christian,

Having the original mail is an attachment is helpful.  Bogofilter ships
with a script named mime.get.rfc822.  It's purpose is to extract an
attached email so that the original email is available to pass to
bogofilter (or wherever).

Test the script to confirm it does what you want.  It may need tweaking
if SA does something odd.

Likely, you'll need to do something like:

    for MSG in dir/* ; do
        mime.get.rfc822 < msg | bogofilter -s
    done

Let me know how it goes!

HTH,

David



More information about the Bogofilter mailing list