troublesome false negative
David Relson
relson at osagesoftware.com
Mon Nov 4 01:10:22 CET 2002
Greetings,
The Graham and Robinson algorithms are clearly two different ways to
calculate spamicity. I wish they were equally good on all messages, but
there seems to be a class of message where one says "spam" and the other
says "ham", even though to a human, the message is clearly and obviously spam.
One of those "obviously spam" messages arrived and Robinson gave it a
0.497731 (ham) rating. I'm wondering what we can do to bogofilter so that
it'll catch messages like this. The message's subject was "Joke of the Day
Nov 2" and the actual subject matter was a bit of joke/story plus a lot of
"lose weight, increase the size of your ..." kind of trash. Graham gave
the message a 0.99000 (spam) rating.
My goal in writing about this troublesome message is to find some ideas for
dealing effectively with it.
David
Below is some information on the message (which is 274 lines, 889 words,
16,457 chars). If anyone wants the actual message, I'll be glad to send it
(in compressed form to bypass the spam check).
The Robinson histogram is:
[root at nic spam-fixups]# bogofilter -r -v -v < spam.1103.1838.txt
X-Bogosity: No, tests=bogofilter, spamicity=0.497731, version=0.8.0-1102.1447
# int cnt prob spamicity histogram
# 0.00 22 0.010732 0.002003 ##################
# 0.10 15 0.167809 0.021700 ############
# 0.20 64 0.222184 0.094994
##################################################
# 0.30 36 0.348368 0.146837 #############################
# 0.40 24 0.441698 0.186142 ###################
# 0.50 23 0.556350 0.231152 ##################
# 0.60 12 0.651680 0.257642 ##########
# 0.70 14 0.745020 0.292841 ###########
# 0.80 23 0.855185 0.356823 ##################
# 0.90 41 0.962410 0.497731 #################################
which shows a lot of high spamicity words (41 between 0.90 and 1.00) as
well as a lot of low spamicity words.
Graham had no problem identifying it as spam:
[root at nic spam-fixups]# bogofilter -g -v -v < spam.1103.1838.txt
X-Bogosity: Yes, tests=bogofilter, spamicity=0.990000, version=0.8.0-1102.1447
# 0.010000 blonde
# 0.010000 consulted
# 0.010000 mankind
# 0.010000 praise
# 0.010000 spreading
# 0.010000 subscribe
# 0.010000 wildfire
# 0.990000 dhea
# 0.990000 ff6633
# 0.990000 m25
# 0.990000 stormpost
# 0.990000 t.pl
# 0.990000 x-list-unsubscribe
# 0.990000 x-stormpost-to
# 0.990000 x-x
In fact I ran "cat spam.1103.1838.txt | bogolexer -p | bogoutil -p -w
$BOGOFILTER_DIR" to look at word probabilities and found 12 words that were
only in the spam list.
Here are the 10 words used most often in the message and their Graham and
Robinson probabilities:
[relson at osage spam-fixups]$ bogolexer -p < spam.1103.1838.txt | sort | uniq
-c | sort -n | tail -10
10 products
14 arial
14 sans-serif
15 helvetica
16 blank
18 area
19 the
56 cgi-bin
56 t.pl
64 http
[relson at osage spam-fixups]$ bogolexer -p < spam.1103.1838.txt | sort | uniq
-c | sort -n | tail -12 | awk '{print $2}' | bogoutil -p -w
/var/lib/bogofilter | sort +3n
spam good Gra prob Rob prob
the 4333 78312 0.296205 0.296181
for 7025 72200 0.425323 0.425273
products 275 2807 0.427001 0.425731
http 6853 40550 0.562461 0.562353
area 386 1441 0.670788 0.667841
nbsp 3025 8758 0.724311 0.723857
cgi-bin 567 1473 0.745415 0.742829
arial 2907 3093 0.877287 0.876547
helvetica 2017 1751 0.897562 0.896439
sans-serif 1795 966 0.933925 0.932543
blank 1574 597 0.952505 0.950858
t.pl 76 0 1.000000 0.963589
More information about the Bogofilter
mailing list