troublesome false negative

Mon Nov 4 03:27:34 CET 2002

On 20021103 (Sun) at 2010:16 -0500, David Relson wrote:
> At 07:44 PM 11/3/02, Greg Louis wrote:
> 
> >On 20021103 (Sun) at 1910:22 -0500, David Relson wrote:
> >>
> >> One of those "obviously spam" messages arrived and [the] Robinson
> >> [method] gave it a 0.497731 (ham) rating. I'm wondering what we can
> >> do to bogofilter so that it'll catch messages like this.
> >
> >Train it, and perhaps tune it.
> >

> One difference is in our word lists.  My spamlist contains many fewer words 
> and messages than my good list.  Currently I have a spamlist built from 
> 6500 messages (yielding 112,000) words and a goodlist built from 32,000 
> messages (303,000 words).  The initial training was 3600 spam messages 
> (55,000 words) and 26,500 good messages (222,000 words).  The updates 
> represent 4 weeks of incoming message - all 8000 of them - using 
> auto-update ('-u' flag) and correcting via '-N' or '-S' for _every_ 
> classification mistake.  After 4 weeks of usage, I should be well up on the 
> training curve.

English has about 1.1 million words, so there's room for expansion ;-)

> As related just a few minutes ago in my message to Tom Spollen, I think 
> that Graham and Robinson have different sensitivities to spam/ham word mix 
> within a message.  Graham clearly has a dependency on the order in which 
> the words occur in the messsage, though this can be countered.  My 
> guesstimate is that Robinson has a dependency on the proportion of ham to 
> good words.  I'm looking for a good way to test this hypothesis.  Any 
> thoughts?

My first thought is that the whole approach is dependent on the
proportion of spam to good words.  All we do (if min_dev is zero) is
assign a weight to each word in the message, and calculate spamicity
from the logarithmic means of the weights and their inverse (1-weight)
values.  So the whole ballgame centers around sensitivity to the
spam/good word mix within a message.

Incidentally, nothing Gary ever wrote says you have to keep min_dev
zero.  On the contrary, you could set it to 0.4 to increase throughput,
and (if my early experiments are worth anything) it would degrade the
performance some, but would still work fairly well.  I started working
with the Robinson method with min_dev at 0.4, but I set it to zero
around the time I did the port to version 0.7.4, because at that time
it worked best for me that way.

Training-list size isn't so crucial, I don't think, once you have
several thousand messages in the smaller of the two.  The calculation
is tolerant of moderate disproportion between the sizes of goodlist and
spamlist; it's to produce that tolerance that we scale the counts to
the message-count of the spamlist.  You would get into trouble if there
were gross differences in the average sizes of messages fed to the
goodlist and to the spamlist, of course; if your only nonspams are
occasional electronic books in plaintext, bogofilter is eventually
going to start thinking no message is spam, just because the word
counts in the goodlist are like those in the spamlist but the goodlist
message-count is so low that they're getting scaled up to the sky :)

> I'd like to avoid judiciously-crafted procmail rules.  That'd be moving 
> towards the need for an expert to craft spam identification rules - a field 
> in which SpamAssassin is king.  The promise of the Bayesian approach is 
> training on the specific mix of messages at the site running bogofilter.

I've tried SpamAssassin, and I think there's a helluva big difference
between the amount of effort you'd have to put into rule design if you
were running that and the minimal effort of supplementation that
bogofilter may need.  But sure, if we can avoid supplements altogether,
it's by far the better solution.

> By the way, I'll gladly send you a copy of the troublesome message to see 
> if you get a different result.

Please do; might tell us something we haven't twigged to yet.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |