Re Tuning Bogofilter

Thu Oct 23 01:22:20 CEST 2008

Message: 1
Date: Mon, 20 Oct 2008 21:18:47 -0400
From: Tom Anderson <tanderson at orderamidchaos.com>
Subject: Re: Tuning bogofilter
To: bf-users <bogofilter at bogofilter.org>
Message-ID: <48FD2DF7.9010303 at orderamidchaos.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

David Relson wrote:

[Hide Quoted Text]
We recommend against training over and over with the same messages as
it biases the wordlist.  Training with ham and spam yields a wordlist
that indicates how often individual words (tokens) occur in ham and in
spam.  For a simplified example:  if "xyzw" occurs in 10% of your ham
and in 20% of your spam, then a message with "xyzw" is twice as likely
to be spam as it is to be ham.  If you keep training with the same spam
you skew the results -- which is not recommended.
I don't really see biasing a spam message as spam to be particularly
problematic.  If indeed a particular word ought to be hammier, then it
will become so in the course of training your hams.  My experience has
been that sometimes you don't receive enough spams to make some tokens
spammy enough, and I therefore train these spams multiple times until
bogofilter recognizes them appropriately as spams.  Otherwise, I will
keep receiving them as false negatives.  For instance, if "xyzw" has
only occurred twice, but it is absolutely 100% always spammy and you
never ever want to see it again, then just keep training on that spam
until bogofilter recognizes it as such.  This is as if you have received
the spam many times, but without the inconvenience of actually having
done so.  When you later train a ham message which contains another
token "abcd" which may have appeared alongside "xyzw" and is now
spammier than it should have been, "abcd" will become hammier while
"xyzw" remains very spammy.  In the end, the result you want.

As I see it, most of the English language should wind up essentially
neutral in your wordlist, with only the truly hammy and spammy words
standing out, like a whitelist and blacklist respectively.  If some
portion of the general language moves slightly hammy or spammy due to
"over-training" some particular emails, it shouldn't have a large effect
on the classification since it is largely only the trigger tokens which
will ultimately decide it.  If a message is so wishy-washy as to contain
no such trigger tokens which are obviously hammy or spammy, or perhaps
well-crafted enough to contain equal numbers of each, then it deserves
to be marked unsure so that you can manually determine it.

Tom

And in a sense, Tom, I did exactly as you suggested.  Using  
bogominitrain.pl does exactly that.

However, if I continued to use the corpus AND new spam e-mail, I'd be  
concerned that I was skewing results so that ham message may become  
spam looking.

That was my main concern.

After thinking about what happened in my case, because I was on such  
an old version.  It made better sense for me to wipe my entire db and  
retrain from my saved corpus.

Additionally, I had old procmail scripts that I had put in place to  
cause messages to not be filtered correctly.

So in the end, for me anyway, doing the initial training with  
bogominitrain.pl worked exactly how I'd expect it to.

Now I'll save messages that didn't get marked as spam and then  
reprocess them with bogominitrain.pl

If anyone can definitively say that re-running my old corpus using  
bogominitrain.pl won't affect my spam scoring in a negative way, then  
it will be OK to continue adding any new missed spam to my main corpus  
and continue using bogominitrain.pl to train the db.

Thanks for your post.

Mike B.

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.