Nearly everything is either 0.500000 or 1.000000

Sat Aug 18 10:42:11 CEST 2007

On 2007-08-15, at 17:58, Tom Anderson wrote:

> At first glance, it seems to me that one or two headers should not  
> have
> that kind of effect.  Moving from 0 to 0.5 would require something  
> else
> than all of a sudden having a few tokens slightly more spammy than
> before.  Are you classifying on the headers only?

No, I'm classifying on headers and body.

> Run a ham through
> with -vvv and see what all of the body tokens are contributing.

Already did that, but it didn't help me much... looks quite OK actually.

>
> As a quick solution, if it were me, I would just grab my entire  
> archive
> of hams and run it through training once.

I meanwhile created a fresh database with all my ham and 450000 spam  
mails -- bogofilter behaves more normally now, with good mail near 0,  
most spam near 1 and false negatives in between. I still have spam  
getting through with a bogosity of about 0.5, but these are  
containing large amounts of (hidden) text which drags the bogosity  
down, so bogofilter seems to do its best.

>
> BTW, this is why I never do batch training in the first place.  Just
> train on error and you should never have problems like this.

This batch *was* training on error, it was all false negatives  
catched by other filters running after bogofilter (in my MUA).

Thanks for replying,

	Jochem

-- 
When the revolution comes, I will be shot by both sides.