Nearly everything is either 0.500000 or 1.000000

Jochem Huhmann joh at gmx.net
Wed Aug 15 14:48:33 CEST 2007


Hi,

I'm using Bogofilter for a few years now and I'm quite happy with it.  
I receive lots of spam, had only two or three false positives and not  
too many false negatives.

But right now, after batch training Bogofilter with about 15000 spams  
filtered by other means, I've observed a strange thing: *All* mails  
get a bogosity of either 0.000000 (very rare, always ham mails),  
0.500000 (nearly all good mails and all false negatives) or 1.000000  
(all spam mails). This was certainly not the case before the last  
training batch, I had good mails always at or very near 0, spam at or  
near 1 and false negatives somewhere in between. As it should be.

Baffling. I decided to have a look at some mails with "bogofilter - 
vvv": It looks as if headers identically found in all mails (spam and  
ham) drag the bogosity heavily towards "spammy", so that even  
otherwise clearly good mails go to 0.500000. These headers are  
inserted by the two last SMTP servers on the way in, so they are  
present in all mails. This again lead me to look at the message counts:

$ bogoutil -p .bogofilter/wordlist.db .MSG_COUNT
                                  spam    good    Fisher
.MSG_COUNT                     126749    5761  0.500000

Eek! This explains it somehow, there is more than twenty times more  
spam than ham and so identical tokens found in all mails will drag  
the bogosity up. But wait: I have received *many* more good mails  
than just 5761! I've about 25000 mails right now laying around (and I  
don't keep everything).

Hmm. Then I remembered that I had set thresh_update=0.01 in  
~/.bogofilter.cf which lead to clearly good or spammy mails not to be  
registered at all and since good mails were nearly always at 0, they  
didn't register anymore. Together with me cleaning up the database  
once a year  from old tokens not used for a while *and* the recent  
batch training  with 15000 spams I have now a database full of spam  
tokens and quite void of ham tokens...

What to do now? Just wait for Bogofilter to catch up on tokens from  
good mail (since they are far away from 0 now they will be registered  
again)? Toss away the database and completely retrain with all good  
mail and an equal amount of spam (is 1:1 a good idea anyway?)?  
Manually remove all the common header tokens from the database to  
make the actually meaningful tokens stand out more?

I have to say that while good mails are now at 0.500000 there's still  
a good distance to the spam cutoff of 0.99, so I don't really fear  
false positives. But seeing both false negatives (which are clearly  
spammy to the eye) *and* perfectly good mail both registering exactly  
the same bogosity makes me somewhat uneasy.


	Jochem

-- 
When the revolution comes, I will be shot by both sides.






More information about the Bogofilter mailing list