Training bogofilter

David Relson relson at osagesoftware.com
Mon Sep 8 13:22:11 CEST 2003


On Mon, 8 Sep 2003 13:28:22 +1000
"Mike Robinson" <BlackMagic at computer.org> wrote:

> Hi David,
> 
> I've got qmail running on my mail server and Outlook Express (OE)
> running on my client machine. I've set up a mail account on my server
> called spam, and I'm forwarding to that account a few hundred spams
> that I've collected in OE on my client machine. I'm wondering if this
> will cloud the issue when I train bogofilter using the messages in the
> spam account, because each message now has the following opening
> lines: Return-Path: <BlackMagic at computer.org>
> Delivered-To: spam at mydomain.com.au
> 
> Will these addresses be tokenised and treated as potential spam? I
> would appreciate your advice on this.
> 
> MJR

Hi Mike,

As a quick test, put those two lines in a file, say "msg.tmp".  Then run
"bogolexer -p < msg.tmp".  This will show you how bogofilter parses
those header lines and what tokens are generated in the process. 
(Knowing the process, I can tell you that there will be 5 tokens.)

When you see the 5 tokens, ask yourself "Will those tokens also appear
in the ham wordlist? How often?"  Also ask "Will those tokens be in real
mail or not?"

If the tokens will only be in your training messages (and not in real
email), there is _no_ problem.  If you've trained with ham, some of the
tokens will be in both.  To see how they score try command "bogoutil -p
your_bogofilter_dir token1 token2 token2".

Also, remember that bogofilter looks at _all_ the tokens in a message. 
So, even if all 5 of the tokens are highly spammish, they are unlikely
to have much effect on a message's spam score.  (Look at the discussion
of the "-v" options in the FAQ for info on how to see why bogofilter
scores a message as it does).

Odds are you'll have no trouble :-)

David

CC: bogofilter mailing list




More information about the bogofilter-dev mailing list