Tuning bogofilter
David Relson
relson at osagesoftware.com
Sat Oct 18 22:38:32 CEST 2008
On Sat, 18 Oct 2008 12:09:24 -0800
barsalou wrote:
> If I repeatedly use the same set of initial spam messages to train
> bogofilter, will that cause it to work less well?
>
> I have a spam corpus to which I continually add messages. Then
> using bogominitrain.pl, occasionally retrain.
>
> I'm wondering if this could cause problem.
>
> My concern is born out of looking at the bogosity header and that
> both my "ham" messages and "spam" messages get a spamicity of .52000
>
> Thanks for any guidance.
>
> Mike B.
H'lo Mike,
We recommend against training over and over with the same messages as
it biases the wordlist. Training with ham and spam yields a wordlist
that indicates how often individual words (tokens) occur in ham and in
spam. For a simplified example: if "xyzw" occurs in 10% of your ham
and in 20% of your spam, then a message with "xyzw" is twice as likely
to be spam as it is to be ham. If you keep training with the same spam
you skew the results -- which is not recommended.
If you're seeing 0.52000 scores for both ham and spam, then there's
something wrong.
Bogofilter has flags that will show you how/why it's scoring a message
as ham or spam. Look in the FAQ for the writeup on "-vv" and "-vvv",
then give these flags a try with sample ham and spam messages to see
what you learn. Also, bogoutil has a "-p" flag that will show the ham
and spam scores of tokens passed to it. That is likely to be helpful.
HTH,
David
More information about the Bogofilter
mailing list