Nearly everything is either 0.500000 or 1.000000
tanderso at oac-design.com
Wed Aug 15 11:58:48 EDT 2007
At first glance, it seems to me that one or two headers should not have
that kind of effect. Moving from 0 to 0.5 would require something else
than all of a sudden having a few tokens slightly more spammy than
before. Are you classifying on the headers only? Run a ham through
with -vvv and see what all of the body tokens are contributing.
As a quick solution, if it were me, I would just grab my entire archive
of hams and run it through training once.
BTW, this is why I never do batch training in the first place. Just
train on error and you should never have problems like this.
Jochem Huhmann wrote:
> I'm using Bogofilter for a few years now and I'm quite happy with it.
> I receive lots of spam, had only two or three false positives and not
> too many false negatives.
> But right now, after batch training Bogofilter with about 15000 spams
> filtered by other means, I've observed a strange thing: *All* mails
> get a bogosity of either 0.000000 (very rare, always ham mails),
> 0.500000 (nearly all good mails and all false negatives) or 1.000000
> (all spam mails). This was certainly not the case before the last
> training batch, I had good mails always at or very near 0, spam at or
> near 1 and false negatives somewhere in between. As it should be.
> Baffling. I decided to have a look at some mails with "bogofilter -
> vvv": It looks as if headers identically found in all mails (spam and
> ham) drag the bogosity heavily towards "spammy", so that even
> otherwise clearly good mails go to 0.500000. These headers are
> inserted by the two last SMTP servers on the way in, so they are
> present in all mails. This again lead me to look at the message counts:
> $ bogoutil -p .bogofilter/wordlist.db .MSG_COUNT
> spam good Fisher
> .MSG_COUNT 126749 5761 0.500000
> Eek! This explains it somehow, there is more than twenty times more
> spam than ham and so identical tokens found in all mails will drag
> the bogosity up. But wait: I have received *many* more good mails
> than just 5761! I've about 25000 mails right now laying around (and I
> don't keep everything).
> Hmm. Then I remembered that I had set thresh_update=0.01 in
> ~/.bogofilter.cf which lead to clearly good or spammy mails not to be
> registered at all and since good mails were nearly always at 0, they
> didn't register anymore. Together with me cleaning up the database
> once a year from old tokens not used for a while *and* the recent
> batch training with 15000 spams I have now a database full of spam
> tokens and quite void of ham tokens...
> What to do now? Just wait for Bogofilter to catch up on tokens from
> good mail (since they are far away from 0 now they will be registered
> again)? Toss away the database and completely retrain with all good
> mail and an equal amount of spam (is 1:1 a good idea anyway?)?
> Manually remove all the common header tokens from the database to
> make the actually meaningful tokens stand out more?
> I have to say that while good mails are now at 0.500000 there's still
> a good distance to the spam cutoff of 0.99, so I don't really fear
> false positives. But seeing both false negatives (which are clearly
> spammy to the eye) *and* perfectly good mail both registering exactly
> the same bogosity makes me somewhat uneasy.
More information about the Bogofilter