Tuning bogofilter

Tue Oct 21 03:18:47 CEST 2008

David Relson wrote:
> We recommend against training over and over with the same messages as
> it biases the wordlist.  Training with ham and spam yields a wordlist
> that indicates how often individual words (tokens) occur in ham and in
> spam.  For a simplified example:  if "xyzw" occurs in 10% of your ham
> and in 20% of your spam, then a message with "xyzw" is twice as likely
> to be spam as it is to be ham.  If you keep training with the same spam
> you skew the results -- which is not recommended.

I don't really see biasing a spam message as spam to be particularly 
problematic.  If indeed a particular word ought to be hammier, then it 
will become so in the course of training your hams.  My experience has 
been that sometimes you don't receive enough spams to make some tokens 
spammy enough, and I therefore train these spams multiple times until 
bogofilter recognizes them appropriately as spams.  Otherwise, I will 
keep receiving them as false negatives.  For instance, if "xyzw" has 
only occurred twice, but it is absolutely 100% always spammy and you 
never ever want to see it again, then just keep training on that spam 
until bogofilter recognizes it as such.  This is as if you have received 
the spam many times, but without the inconvenience of actually having 
done so.  When you later train a ham message which contains another 
token "abcd" which may have appeared alongside "xyzw" and is now 
spammier than it should have been, "abcd" will become hammier while 
"xyzw" remains very spammy.  In the end, the result you want.

As I see it, most of the English language should wind up essentially 
neutral in your wordlist, with only the truly hammy and spammy words 
standing out, like a whitelist and blacklist respectively.  If some 
portion of the general language moves slightly hammy or spammy due to 
"over-training" some particular emails, it shouldn't have a large effect 
on the classification since it is largely only the trigger tokens which 
will ultimately decide it.  If a message is so wishy-washy as to contain 
no such trigger tokens which are obviously hammy or spammy, or perhaps 
well-crafted enough to contain equal numbers of each, then it deserves 
to be marked unsure so that you can manually determine it.

Tom