troublesome false negative

Mon Nov 4 02:10:16 CET 2002

At 07:44 PM 11/3/02, Greg Louis wrote:

>On 20021103 (Sun) at 1910:22 -0500, David Relson wrote:
> >
> > One of those "obviously spam" messages arrived and [the] Robinson
> > [method] gave it a 0.497731 (ham) rating. I'm wondering what we can
> > do to bogofilter so that it'll catch messages like this.
>
>Train it, and perhaps tune it.
>
>No, that's not intended as a flip or in any way dismissive answer.  I
>put bogofilter to work for me about six weeks ago with Robinson's
>method of calculation.  At the same time, precisely because I wanted to
>see what it could do and what training it needed, I disabled my
>antispam procmail rules.  At that time, I had a Graham-built training
>set of about 4500 nonspams and 1100 spams.  At first there was a
>disheartening flood of false negatives -- probably in the region of
>12-15 percent.  I fed every one of those into bogofilter -s, and the
>few false positives into bogofilter -n.  I fiddled and futzed with the
>s and x parameters and with SPAM_CUTOFF.  And the error rate came down,
>and down, and down...

One difference is in our word lists.  My spamlist contains many fewer words 
and messages than my good list.  Currently I have a spamlist built from 
6500 messages (yielding 112,000) words and a goodlist built from 32,000 
messages (303,000 words).  The initial training was 3600 spam messages 
(55,000 words) and 26,500 good messages (222,000 words).  The updates 
represent 4 weeks of incoming message - all 8000 of them - using 
auto-update ('-u' flag) and correcting via '-N' or '-S' for _every_ 
classification mistake.  After 4 weeks of usage, I should be well up on the 
training curve.

As related just a few minutes ago in my message to Tom Spollen, I think 
that Graham and Robinson have different sensitivities to spam/ham word mix 
within a message.  Graham clearly has a dependency on the order in which 
the words occur in the messsage, though this can be countered.  My 
guesstimate is that Robinson has a dependency on the proportion of ham to 
good words.  I'm looking for a good way to test this hypothesis.  Any thoughts?

>Now I have about 5500 nonspams and 5600 spams in the training set, and
>I'm getting less than 4% false negatives in production (new emails, not
>ones bogofilter's already seen), and a false positive rate in the 0.05%
>range.  Will it get better than that?  Maybe not, or maybe -- six weeks
>is not a long time -- or perhaps there'll be a quantum leap with Gary
>Robinson's new chi-squared method (which I'll be testing soon).  Will
>it ever be zero false negatives and zero false positives?  I rather
>doubt it.  New formulations keep popping up as the spammers try to
>outwit programs like SpamAssassin.  Whether it be by supervised
>training or by developing new filtering rules, a spam detection program
>has to be kept current, and I suspect it will always be in catch-up
>mode.
>
>Once we have enough experience with bogofilter that we know its
>specific weaknesses, a few judiciously-crafted procmail rules may be
>helpful in catching those corner cases.  For now, though, I'd think it
>best to see what needs to be done to optimize the Robinson method
>without any external hacks.

I'd like to avoid judiciously-crafted procmail rules.  That'd be moving 
towards the need for an expert to craft spam identification rules - a field 
in which SpamAssassin is king.  The promise of the Bayesian approach is 
training on the specific mix of messages at the site running bogofilter.

By the way, I'll gladly send you a copy of the troublesome message to see 
if you get a different result.

David