Robinson vs Graham - a testing methodology

Greg Louis glouis at dynamicro.on.ca
Fri Oct 25 13:22:01 CEST 2002


On 20021024 (Thu) at 2046:15 -0400, David Relson wrote:

> *** Learning ***
> 
> Each message is again classified twice by bogofilter.  The message again 
> goes into the G, R, GR, or NN bin for counting.  However, immediately after 
> classification, each message classified as spam (by either or both 
> algorithms) is fed into the spam list and messages classified NN go into 
> the non-spam list.  All messages are processed in this manner - classify 
> twice, then update word list.  Again, the final counts of G, R, ... are 
> tallied (and saved).  Any changes in the tallies reflect what bogofilter 
> has learned while processing this learning phase.

I'd suggest doing an additional, separate "supervised learning" test,
in which the procedure is similar to the above but the destination list
(spam or non-spam) for each message is determined by a human.  It is
certainly valuable to test how well each algorithm copes with its own
errors, which is what the unsupervised test will show us; however, I
believe it would also be valuable to test how well each calculation
method learns when fed "known good" data; the more so, perhaps, because
I suspect that neither method will learn well by itself if the initial
error rate is significant.

I can collect a few thousand mixed emails easily in a couple of days,
so I've set up to do so.  I'll follow your method, to provide an
independent test with directly comparable results.  I hope I've
persuaded you that we should do both unsupervised and supervised
testing; I will, but it would be nice to have data from your test
corpus as well.  I hope that some other readers of this list will also
be in a position to contribute; the more results we get, the better
chance we have of a successful test (one that yields a clear result one
way or the other).

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |




More information about the Bogofilter mailing list