relson at osagesoftware.com
Sat Oct 18 16:38:32 EDT 2008
On Sat, 18 Oct 2008 12:09:24 -0800
> If I repeatedly use the same set of initial spam messages to train
> bogofilter, will that cause it to work less well?
> I have a spam corpus to which I continually add messages. Then
> using bogominitrain.pl, occasionally retrain.
> I'm wondering if this could cause problem.
> My concern is born out of looking at the bogosity header and that
> both my "ham" messages and "spam" messages get a spamicity of .52000
> Thanks for any guidance.
> Mike B.
We recommend against training over and over with the same messages as
it biases the wordlist. Training with ham and spam yields a wordlist
that indicates how often individual words (tokens) occur in ham and in
spam. For a simplified example: if "xyzw" occurs in 10% of your ham
and in 20% of your spam, then a message with "xyzw" is twice as likely
to be spam as it is to be ham. If you keep training with the same spam
you skew the results -- which is not recommended.
If you're seeing 0.52000 scores for both ham and spam, then there's
Bogofilter has flags that will show you how/why it's scoring a message
as ham or spam. Look in the FAQ for the writeup on "-vv" and "-vvv",
then give these flags a try with sample ham and spam messages to see
what you learn. Also, bogoutil has a "-p" flag that will show the ham
and spam scores of tokens passed to it. That is likely to be helpful.
More information about the Bogofilter