false positive
Barry Gould
BarryGould at PennySaverUSA.net
Mon Jan 20 21:16:12 CET 2003
Hi,
I just found that this month's Crypto-Gram newsletter (which I subscribe
to, have trained on, and is normally recognized as ham) was in my spambin.
Not only that, but it got a score of 0.99!
I'm not exactly sure what's going on, so I'm going to ramble about several
things below.
I currently only auto-train on spam (after initially training both spam and
ham with several thousand messages), and I suspect that hidden keywords in
spam I've received may have helped cause this.
However, I only keep my own spam, not that of all users, so I can't confirm
this.
I do see that the word 'geopolitical' did appear in one of my spams (a
political spam), and now in Crypto-Gram.
Most of the other words must have been spam to other users. I don't know
whether they were hidden keywords or part of the message.
Anyways, I'm not really sure if we can do anything about this or not, but I
do have a few comments:
Bogofilter is 0.8.0, using standard algorithm and no config file.
Out of the 15 selected words, there were 7 that were hammy, and 8 that were
spammy.
It seems odd to me that this would result in a score of 0.99, but I haven't
done the math.
According to bogofilter -vv, all hammy words seem to have the same prob
(0.010), and all spam are 0.990.
However, bogoutil reports otherwise. Am I misreading bogofilter -vv?
(outputs are listed below)
What is up here:
# bogoutil -p -w .
spam good prob
crypto-gram 4 12 0.559566
??
How can 12 good / 4 bad be considered spam???
I note that ALL of the words output by bogofilter -vv are only registered a
few times in the dbs (most of the words seem to appear 6 or 7 times in
spam.db, and 0 in good.db).
There seems to be a lot of rounding going on here.
midas appears 15-0, and hostilities is 6-0, but they both seem to get the
same score (0.990000) according to bogofilter -vv
also note differences between words such as 'www.infoworld.com' and 'perl':
very different (factor of 5), but both get rounded to 0.010 ?
Note that scores may not exactly match, as the Crypto-Gram was received on
Jan 16th, and it is now the 20th.
I did however check the past 8 days of backups for the word 'barlow', and I
have confirmed that no new mail with that name/word has been received.
BTW, this is the ONLY false positive I have seen since training many months
ago, so overall, I am VERY pleased with bogofilter.
I've seen many false positives from spamassassin on my university account
(though I admit training/whitelisting probably needs to be done).
Thanks,
Barry
# bogoutil -w . .MSG_COUNT
spam good
.MSG_COUNT 8237 31395
# bogoutil -d goodlist.db |wc -l
227160
# bogoutil -d spamlist.db |wc -l
83511
bogofilter -vv < crypto-gram
X-Bogosity: Yes, tests=bogofilter, spamicity=0.990000, version=0.8.0
0.010000 barlow
0.010000 packets
0.010000 patches
0.010000 perl
0.010000 pub
0.010000 stereo
0.010000 www.infoworld.com
0.990000 crafting
0.990000 differentiation
0.990000 fatally
0.990000 geopolitical
0.990000 guaranteeing
0.990000 hostilities
0.990000 impede
0.990000 midas
# bogoutil -p -w .
spam good prob
crypto 1 31 0.109489
crypto-gram 4 12 0.559566
barlow 1 312 0.012069
packets 2 517 0.014530
patches 1 247 0.015197
perl 9 2213 0.015264
pub 21 12059 0.006594
stereo 1 568 0.006666
www.infoworld.com 1 956 0.003971
crafting 7 0 1.000000
differentiation 7 0 1.000000
fatally 7 0 1.000000
geopolitical 7 0 1.000000
guaranteeing 7 0 1.000000
hostilities 6 0 1.000000
impede 8 0 1.000000
midas 15 0 1.000000
More information about the Bogofilter
mailing list