Nearly everything is either 0.500000 or 1.000000
Jochem Huhmann
joh at gmx.net
Wed Aug 15 14:48:33 CEST 2007
Hi,
I'm using Bogofilter for a few years now and I'm quite happy with it.
I receive lots of spam, had only two or three false positives and not
too many false negatives.
But right now, after batch training Bogofilter with about 15000 spams
filtered by other means, I've observed a strange thing: *All* mails
get a bogosity of either 0.000000 (very rare, always ham mails),
0.500000 (nearly all good mails and all false negatives) or 1.000000
(all spam mails). This was certainly not the case before the last
training batch, I had good mails always at or very near 0, spam at or
near 1 and false negatives somewhere in between. As it should be.
Baffling. I decided to have a look at some mails with "bogofilter -
vvv": It looks as if headers identically found in all mails (spam and
ham) drag the bogosity heavily towards "spammy", so that even
otherwise clearly good mails go to 0.500000. These headers are
inserted by the two last SMTP servers on the way in, so they are
present in all mails. This again lead me to look at the message counts:
$ bogoutil -p .bogofilter/wordlist.db .MSG_COUNT
spam good Fisher
.MSG_COUNT 126749 5761 0.500000
Eek! This explains it somehow, there is more than twenty times more
spam than ham and so identical tokens found in all mails will drag
the bogosity up. But wait: I have received *many* more good mails
than just 5761! I've about 25000 mails right now laying around (and I
don't keep everything).
Hmm. Then I remembered that I had set thresh_update=0.01 in
~/.bogofilter.cf which lead to clearly good or spammy mails not to be
registered at all and since good mails were nearly always at 0, they
didn't register anymore. Together with me cleaning up the database
once a year from old tokens not used for a while *and* the recent
batch training with 15000 spams I have now a database full of spam
tokens and quite void of ham tokens...
What to do now? Just wait for Bogofilter to catch up on tokens from
good mail (since they are far away from 0 now they will be registered
again)? Toss away the database and completely retrain with all good
mail and an equal amount of spam (is 1:1 a good idea anyway?)?
Manually remove all the common header tokens from the database to
make the actually meaningful tokens stand out more?
I have to say that while good mails are now at 0.500000 there's still
a good distance to the spam cutoff of 0.99, so I don't really fear
false positives. But seeing both false negatives (which are clearly
spammy to the eye) *and* perfectly good mail both registering exactly
the same bogosity makes me somewhat uneasy.
Jochem
--
When the revolution comes, I will be shot by both sides.
More information about the Bogofilter
mailing list