troublesome false negative

David Relson relson at osagesoftware.com
Mon Nov 4 01:10:22 CET 2002


Greetings,

The Graham and Robinson algorithms are clearly two different ways to 
calculate spamicity.  I wish they were equally good on all messages, but 
there seems to be a class of message where one says "spam" and the other 
says "ham", even though to a human, the message is clearly and obviously spam.

One of those "obviously spam" messages arrived and Robinson gave it a 
0.497731 (ham) rating.  I'm wondering what we can do to bogofilter so that 
it'll catch messages like this.  The message's subject was "Joke of the Day 
Nov 2" and the actual subject matter was a bit of joke/story plus a lot of 
"lose weight,  increase the size of your ..." kind of trash.  Graham gave 
the message a 0.99000 (spam) rating.

My goal in writing about this troublesome message is to find some ideas for 
dealing effectively with it.

David

Below is some information on the message (which is 274 lines, 889 words, 
16,457 chars).  If anyone wants the actual message, I'll be glad to send it 
(in compressed form to bypass the spam check).

The Robinson histogram is:

[root at nic spam-fixups]# bogofilter -r -v -v < spam.1103.1838.txt
X-Bogosity: No, tests=bogofilter, spamicity=0.497731, version=0.8.0-1102.1447
#      int  cnt    prob   spamicity  histogram
#     0.00   22  0.010732  0.002003  ##################
#     0.10   15  0.167809  0.021700  ############
#     0.20   64  0.222184  0.094994 
##################################################
#     0.30   36  0.348368  0.146837  #############################
#     0.40   24  0.441698  0.186142  ###################
#     0.50   23  0.556350  0.231152  ##################
#     0.60   12  0.651680  0.257642  ##########
#     0.70   14  0.745020  0.292841  ###########
#     0.80   23  0.855185  0.356823  ##################
#     0.90   41  0.962410  0.497731  #################################

which shows a lot of high spamicity words (41 between 0.90 and 1.00) as 
well as a lot of low spamicity words.

Graham had no problem identifying it as spam:

[root at nic spam-fixups]# bogofilter -g -v -v < spam.1103.1838.txt
X-Bogosity: Yes, tests=bogofilter, spamicity=0.990000, version=0.8.0-1102.1447
#    0.010000  blonde
#    0.010000  consulted
#    0.010000  mankind
#    0.010000  praise
#    0.010000  spreading
#    0.010000  subscribe
#    0.010000  wildfire
#    0.990000  dhea
#    0.990000  ff6633
#    0.990000  m25
#    0.990000  stormpost
#    0.990000  t.pl
#    0.990000  x-list-unsubscribe
#    0.990000  x-stormpost-to
#    0.990000  x-x

In fact I ran "cat spam.1103.1838.txt | bogolexer -p | bogoutil -p -w 
$BOGOFILTER_DIR" to look at word probabilities and found 12 words that were 
only in the spam list.

Here are the 10 words used most often in the message and their Graham and 
Robinson probabilities:

[relson at osage spam-fixups]$ bogolexer -p < spam.1103.1838.txt | sort | uniq 
-c | sort -n | tail -10
      10	products
      14	arial
      14	sans-serif
      15	helvetica
      16	blank
      18	area
      19	the
      56	cgi-bin
      56	t.pl
      64	http

[relson at osage spam-fixups]$ bogolexer -p < spam.1103.1838.txt | sort | uniq 
-c | sort -n | tail -12 | awk '{print $2}' | bogoutil -p -w 
/var/lib/bogofilter | sort +3n
                        spam    good  Gra prob  Rob prob
the                    4333   78312  0.296205  0.296181
for                    7025   72200  0.425323  0.425273
products                275    2807  0.427001  0.425731
http                   6853   40550  0.562461  0.562353
area                    386    1441  0.670788  0.667841
nbsp                   3025    8758  0.724311  0.723857
cgi-bin                 567    1473  0.745415  0.742829
arial                  2907    3093  0.877287  0.876547
helvetica              2017    1751  0.897562  0.896439
sans-serif             1795     966  0.933925  0.932543
blank                  1574     597  0.952505  0.950858
t.pl                     76       0  1.000000  0.963589





More information about the Bogofilter mailing list