Tuning bogofilter

Sat Oct 18 22:38:32 CEST 2008

On Sat, 18 Oct 2008 12:09:24 -0800
barsalou wrote:

> If I repeatedly use the same set of initial spam messages to train  
> bogofilter, will that cause it to work less well?
> 
> I have a spam corpus to which I continually add messages.  Then
> using bogominitrain.pl, occasionally retrain.
> 
> I'm wondering if this could cause problem.
> 
> My concern is born out of looking at the bogosity header and that
> both my "ham" messages and "spam" messages get a spamicity of .52000
> 
> Thanks for any guidance.
> 
> Mike B.

H'lo Mike,

We recommend against training over and over with the same messages as
it biases the wordlist.  Training with ham and spam yields a wordlist
that indicates how often individual words (tokens) occur in ham and in
spam.  For a simplified example:  if "xyzw" occurs in 10% of your ham
and in 20% of your spam, then a message with "xyzw" is twice as likely
to be spam as it is to be ham.  If you keep training with the same spam
you skew the results -- which is not recommended.

If you're seeing 0.52000 scores for both ham and spam, then there's
something wrong.  

Bogofilter has flags that will show you how/why it's scoring a message
as ham or spam.  Look in the FAQ for the writeup on "-vv" and "-vvv",
then give these flags a try with sample ham and spam messages to see
what you learn.  Also, bogoutil has a "-p" flag that will show the ham
and spam scores of tokens passed to it.  That is likely to be helpful.

HTH,

David