barjunk at attglobal.net
Mon Oct 20 00:21:12 EDT 2008
> Message: 2
> Date: Sat, 18 Oct 2008 16:38:32 -0400
> From: David Relson <relson at osagesoftware.com>
> Subject: Re: Tuning bogofilter
> Cc: bogofilter at bogofilter.org
> Message-ID: <20081018163832.327b1f22 at osage.osagesoftware.com>
> Content-Type: text/plain; charset=US-ASCII
> On Sat, 18 Oct 2008 12:09:24 -0800
> barsalou wrote:
>> If I repeatedly use the same set of initial spam messages to train
>> bogofilter, will that cause it to work less well?
>> I have a spam corpus to which I continually add messages. Then
>> using bogominitrain.pl, occasionally retrain.
>> I'm wondering if this could cause problem.
>> My concern is born out of looking at the bogosity header and that
>> both my "ham" messages and "spam" messages get a spamicity of .52000
>> Thanks for any guidance.
>> Mike B.
> H'lo Mike,
> We recommend against training over and over with the same messages as
> it biases the wordlist. Training with ham and spam yields a wordlist
> that indicates how often individual words (tokens) occur in ham and in
> spam. For a simplified example: if "xyzw" occurs in 10% of your ham
> and in 20% of your spam, then a message with "xyzw" is twice as likely
> to be spam as it is to be ham. If you keep training with the same spam
> you skew the results -- which is not recommended.
> If you're seeing 0.52000 scores for both ham and spam, then there's
> something wrong.
> Bogofilter has flags that will show you how/why it's scoring a message
> as ham or spam. Look in the FAQ for the writeup on "-vv" and "-vvv",
> then give these flags a try with sample ham and spam messages to see
> what you learn. Also, bogoutil has a "-p" flag that will show the ham
> and spam scores of tokens passed to it. That is likely to be helpful.
I cleared my wordlist and retrained it from my corpus. Also found
that the newer version I was using used the word 'Spam' instead of
'Yes' in the X-Bogosity line. So procmail wasn't doing the sorting
So things are working more like I'd expect them to.
You said using the same spam corpus will skew the results...but it
isn't clear to me in what way it will do that. I assume you mean that
it will mark words as being more spammy than they should be...is that
Thanks for your response.
This message was sent using IMP, the Internet Messaging Program.
More information about the Bogofilter