Robinson vs Graham - a testing methodology

Fri Oct 25 02:46:15 CEST 2002

Greetings,

As many of you know, I've been working with Greg Louis to merge his 
implementation of the Robinson algorithm into bogofilter. Some weeks ago, I 
took an early version of his code and merged it into my private, test 
version of bogofilter.  Back then, I ran some Graham vs. Robinson 
comparison tests.  For my test, I had a set of 32 spam and 10 good messages 
that bogofilter hadn't previously encounterd.  The Robinson algorithm 
recognized about 50% more of the spam than did the Graham algorithm.

That was then.  This is now.

Now we have the Robinson algorithm included in the current (CVS) source 
tree for bogofilter.  The big question to answer is whether we should 
convert bogofilter from the Graham algorithm to the Robinson algorithm?  or 
not?  And why should we do so? or not?

To answer these question, information needs to be gathered, performance 
measured, and statistics gathered.

To further this work, I have an additional 17 days of email that have come 
in since I started using bogofilter in production.  This gives me an 
additional several thousand messages, including some hundreds of spam 
messages.  I've been thinking about how to do some meaningful testing using 
this data.  Here are my thoughts, ideas, and plans...

First, I'll use bogofilter word lists that predate these new 
messages.  This prevents bogofilter from seeing a message it has already 
been told is spam or is not spam.

Second, the testing will have two major components.  One is counting and 
the other is learning.

*** Counting ***

Each message is classified twice by bogofilter - once for each 
algorithm.  The messages is categorized as:

	G  - only Graham indicates spam
	R  - only Robinson indicates spam
	GR - both algorithms indicate spam
	NN - both algorithms indicate non-spam

and a count is made of messages vs. group.  This indicates how well the 
algorithms do with the starting word lists.

*** Learning ***

Each message is again classified twice by bogofilter.  The message again 
goes into the G, R, GR, or NN bin for counting.  However, immediately after 
classification, each message classified as spam (by either or both 
algorithms) is fed into the spam list and messages classified NN go into 
the non-spam list.  All messages are processed in this manner - classify 
twice, then update word list.  Again, the final counts of G, R, ... are 
tallied (and saved).  Any changes in the tallies reflect what bogofilter 
has learned while processing this learning phase.

The counting and learning phases are repeated several times, using the 
updated wordlists each time.  Message counts are saved.  The goal of this 
repetition is to measure the learning effect and and see quickly the counts 
converge on a final result.

*** Results ***

I don't have them yet.  I'm still putting together a test script and 
testing it.  I'll report when I have results.

David
--------------------------------------------------------
David Relson                   Osage Software Systems, Inc.
relson at osagesoftware.com       Ann Arbor, MI 48103
www.osagesoftware.com          tel:  734.821.8800