Nearly everything is either 0.500000 or 1.000000

Wed Aug 15 17:58:48 CEST 2007

At first glance, it seems to me that one or two headers should not have 
that kind of effect.  Moving from 0 to 0.5 would require something else 
than all of a sudden having a few tokens slightly more spammy than 
before.  Are you classifying on the headers only?  Run a ham through 
with -vvv and see what all of the body tokens are contributing.

As a quick solution, if it were me, I would just grab my entire archive 
of hams and run it through training once.

BTW, this is why I never do batch training in the first place.  Just 
train on error and you should never have problems like this.

Tom

Jochem Huhmann wrote:
> Hi,
> 
> I'm using Bogofilter for a few years now and I'm quite happy with it.  
> I receive lots of spam, had only two or three false positives and not  
> too many false negatives.
> 
> But right now, after batch training Bogofilter with about 15000 spams  
> filtered by other means, I've observed a strange thing: *All* mails  
> get a bogosity of either 0.000000 (very rare, always ham mails),  
> 0.500000 (nearly all good mails and all false negatives) or 1.000000  
> (all spam mails). This was certainly not the case before the last  
> training batch, I had good mails always at or very near 0, spam at or  
> near 1 and false negatives somewhere in between. As it should be.
> 
> Baffling. I decided to have a look at some mails with "bogofilter - 
> vvv": It looks as if headers identically found in all mails (spam and  
> ham) drag the bogosity heavily towards "spammy", so that even  
> otherwise clearly good mails go to 0.500000. These headers are  
> inserted by the two last SMTP servers on the way in, so they are  
> present in all mails. This again lead me to look at the message counts:
> 
> $ bogoutil -p .bogofilter/wordlist.db .MSG_COUNT
>                                   spam    good    Fisher
> .MSG_COUNT                     126749    5761  0.500000
> 
> Eek! This explains it somehow, there is more than twenty times more  
> spam than ham and so identical tokens found in all mails will drag  
> the bogosity up. But wait: I have received *many* more good mails  
> than just 5761! I've about 25000 mails right now laying around (and I  
> don't keep everything).
> 
> Hmm. Then I remembered that I had set thresh_update=0.01 in  
> ~/.bogofilter.cf which lead to clearly good or spammy mails not to be  
> registered at all and since good mails were nearly always at 0, they  
> didn't register anymore. Together with me cleaning up the database  
> once a year  from old tokens not used for a while *and* the recent  
> batch training  with 15000 spams I have now a database full of spam  
> tokens and quite void of ham tokens...
> 
> What to do now? Just wait for Bogofilter to catch up on tokens from  
> good mail (since they are far away from 0 now they will be registered  
> again)? Toss away the database and completely retrain with all good  
> mail and an equal amount of spam (is 1:1 a good idea anyway?)?  
> Manually remove all the common header tokens from the database to  
> make the actually meaningful tokens stand out more?
> 
> I have to say that while good mails are now at 0.500000 there's still  
> a good distance to the spam cutoff of 0.99, so I don't really fear  
> false positives. But seeing both false negatives (which are clearly  
> spammy to the eye) *and* perfectly good mail both registering exactly  
> the same bogosity makes me somewhat uneasy.
> 
> 
> 	Jochem
>