sample Robinson histograms

David Relson relson at osagesoftware.com
Mon Oct 28 20:41:18 CET 2002


Greetings,

As you all know, over the past week or so, Greg Louis' code for the 
Robinson algorithm has been added to bogofilter.  It's an alternate method 
for computing spamicity that shows great promise for improving bogofilter's 
accuracy at identifying spam.

Wearing my hat as a user, I like my delivered spam to include some 
information about _why_ it was classified as spam.  To that end, my 
procmail recipe uses flags "-p" and "-v".  With the Graham algorithm, this 
lists the 15 extrema words, their spam probabilities, and the calculated 
spamicity result.  Doing the same thing for Robinson is a bit awkward since 
Robinson calculates spamicity using all words of the message (not just the 
15 "most interesting").  Listing all words with probabilities, etc, just is 
not feasible.

Using the R statistical language, Greg is able to generate a bar char 
(histogram) showing word probabilities for a message.  I think that is a 
really neat thing to do.  Thinking about that, over the weekend I 
implemented a simple histogram capability for bogofilter.

Here are samples from a couple of spam received today:


SUBJECT: Get the Computer Skills you need from Video Professor + 1000 miles
X-Bogosity: Yes, tests=bogofilter, spamicity=0.639510, version=0.7.6-1028.1015

#   int  cnt    prob   spamicity  histogram
#  0.00    7  0.066754  0.025292  ######
#  0.10   22  0.161080  0.073685  #################
#  0.20   29  0.264727  0.133399  ######################
#  0.30   47  0.343250  0.209310  ####################################
#  0.40   39  0.445507  0.272259  ##############################
#  0.50   33  0.547515  0.328232  #########################
#  0.60   25  0.642234  0.372655  ###################
#  0.70   12  0.755678  0.398555  #########
#  0.80   35  0.857297  0.477548  ###########################
#  0.90   67  0.985145  0.639510 
##################################################


Subject: How Bad is Your Credit?  Check for Free!
X-Bogosity: Yes, tests=bogofilter, spamicity=0.687086, version=0.7.6-1028.1254

#   int  cnt    prob   spamicity  histogram
#  0.00    1  0.097515  0.042730  #
#  0.10    1  0.183198  0.071822  #
#  0.20    7  0.232483  0.139328  #######
#  0.30   10  0.345227  0.221346  ##########
#  0.40    9  0.439136  0.286561  #########
#  0.50   10  0.550528  0.361301  ##########
#  0.60   11  0.651327  0.435680  ###########
#  0.70    2  0.772034  0.451832  ##
#  0.80   17  0.845352  0.563532  #################
#  0.90   20  0.976594  0.687086  ####################

The histogram is added to email by using flags "-r", "-p", and "-v".  If 
you're just testing, and not processing messages for delivery, the same 
effect can be achieved using "-r -v -v".

Enjoy!

David





More information about the Bogofilter mailing list