false positive

Barry Gould BarryGould at PennySaverUSA.net
Mon Jan 20 21:16:12 CET 2003


Hi,
I just found that this month's Crypto-Gram newsletter (which I subscribe 
to, have trained on, and is normally recognized as ham) was in my spambin.
Not only that, but it got a score of 0.99!

I'm not exactly sure what's going on, so I'm going to ramble about several 
things below.

I currently only auto-train on spam (after initially training both spam and 
ham with several thousand messages), and I suspect that hidden keywords in 
spam I've received may have helped cause this.
However, I only keep my own spam, not that of all users, so I can't confirm 
this.

I do see that the word 'geopolitical' did appear in one of my spams (a 
political spam), and now in Crypto-Gram.
Most of the other words must have been spam to other users. I don't know 
whether they were hidden keywords or part of the message.

Anyways, I'm not really sure if we can do anything about this or not, but I 
do have a few comments:

Bogofilter is 0.8.0, using standard algorithm and no config file.

Out of the 15 selected words, there were 7 that were hammy, and 8 that were 
spammy.
It seems odd to me that this would result in a score of 0.99, but I haven't 
done the math.

According to bogofilter -vv, all hammy words seem to have the same prob 
(0.010), and all spam are 0.990.
However, bogoutil reports otherwise. Am I misreading bogofilter -vv?
(outputs are listed below)

What is up here:
# bogoutil -p -w .
                 spam   good   prob
crypto-gram     4     12    0.559566
??
How can 12 good / 4 bad be considered spam???

I note that ALL of the words output by bogofilter -vv are only registered a 
few times in the dbs (most of the words seem to appear 6 or 7 times in 
spam.db, and 0 in good.db).

There seems to be a lot of rounding going on here.
midas appears 15-0, and hostilities is 6-0, but they both seem to get the 
same score (0.990000) according to bogofilter -vv
also note differences between words such as 'www.infoworld.com' and 'perl': 
very different (factor of 5), but both get rounded to 0.010 ?

Note that scores may not exactly match, as the Crypto-Gram was received on 
Jan 16th, and it is now the 20th.
I did however check the past 8 days of backups for the word 'barlow', and I 
have confirmed that no new mail with that name/word has been received.

BTW, this is the ONLY false positive I have seen since training many months 
ago, so overall, I am VERY pleased with bogofilter.
I've seen many false positives from spamassassin on my university account 
(though I admit training/whitelisting probably needs to be done).

Thanks,
Barry

# bogoutil -w . .MSG_COUNT
                        spam   good
.MSG_COUNT             8237  31395

# bogoutil -d goodlist.db |wc -l
  227160

# bogoutil -d spamlist.db |wc -l
   83511

bogofilter -vv < crypto-gram
X-Bogosity: Yes, tests=bogofilter, spamicity=0.990000, version=0.8.0
         0.010000  barlow
         0.010000  packets
         0.010000  patches
         0.010000  perl
         0.010000  pub
         0.010000  stereo
         0.010000  www.infoworld.com
         0.990000  crafting
         0.990000  differentiation
         0.990000  fatally
         0.990000  geopolitical
         0.990000  guaranteeing
         0.990000  hostilities
         0.990000  impede
         0.990000  midas

# bogoutil -p -w .
                 spam good   prob
crypto          1       31      0.109489
crypto-gram     4       12      0.559566
barlow          1       312     0.012069
packets         2       517     0.014530
patches         1       247     0.015197
perl            9       2213    0.015264
pub             21      12059   0.006594
stereo          1       568     0.006666
www.infoworld.com 1     956     0.003971
crafting                7       0       1.000000
differentiation 7       0       1.000000
fatally         7       0       1.000000
geopolitical    7       0       1.000000
guaranteeing    7       0       1.000000
hostilities     6       0       1.000000
impede          8       0       1.000000
midas           15      0       1.000000





More information about the Bogofilter mailing list