troublesome false negative

Greg Louis glouis at dynamicro.on.ca
Mon Nov 4 01:44:11 CET 2002


On 20021103 (Sun) at 1910:22 -0500, David Relson wrote:
> 
> One of those "obviously spam" messages arrived and [the] Robinson
> [method] gave it a 0.497731 (ham) rating. I'm wondering what we can
> do to bogofilter so that it'll catch messages like this.

Train it, and perhaps tune it.

No, that's not intended as a flip or in any way dismissive answer.  I
put bogofilter to work for me about six weeks ago with Robinson's
method of calculation.  At the same time, precisely because I wanted to
see what it could do and what training it needed, I disabled my
antispam procmail rules.  At that time, I had a Graham-built training
set of about 4500 nonspams and 1100 spams.  At first there was a
disheartening flood of false negatives -- probably in the region of
12-15 percent.  I fed every one of those into bogofilter -s, and the
few false positives into bogofilter -n.  I fiddled and futzed with the
s and x parameters and with SPAM_CUTOFF.  And the error rate came down,
and down, and down...

Now I have about 5500 nonspams and 5600 spams in the training set, and
I'm getting less than 4% false negatives in production (new emails, not
ones bogofilter's already seen), and a false positive rate in the 0.05%
range.  Will it get better than that?  Maybe not, or maybe -- six weeks
is not a long time -- or perhaps there'll be a quantum leap with Gary
Robinson's new chi-squared method (which I'll be testing soon).  Will
it ever be zero false negatives and zero false positives?  I rather
doubt it.  New formulations keep popping up as the spammers try to
outwit programs like SpamAssassin.  Whether it be by supervised
training or by developing new filtering rules, a spam detection program
has to be kept current, and I suspect it will always be in catch-up
mode.

Once we have enough experience with bogofilter that we know its
specific weaknesses, a few judiciously-crafted procmail rules may be
helpful in catching those corner cases.  For now, though, I'd think it
best to see what needs to be done to optimize the Robinson method
without any external hacks.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |




More information about the Bogofilter mailing list