randomtrain vs bogotrain.pl

David Relson relson at osagesoftware.com
Fri Jul 4 15:22:34 CEST 2003


Greetings,

Being curious, yesterday I did some experimentation with randomtrain and 
pi's bogotrain.pl.

I had approx 55 Nigerian spam and 80 non-spam.  Unfortunately, I didn't 
save the results, but I do remember that bogotrain.pl found 5 spam and 3 
non-spam to be adequate for complete training while randomtrain's numbers 
were about 10 higher (15 and 13).  The randomtrain numbers changed slightly 
in different runs, but were always significantly higher than bogotrain.pl's 
numbers.  I attributed this to an "ordering effect" and didn't investigate 
further.  Anyhow, after training with either script, all the Nigerian spam 
messages were successfully scored as spam.

Today I've run another test set with 75 Nigerian spam and 585 non-spam.  As 
test 1, I put _all_ the messages in spamlist and goodlist.  Test 2 used 
randomtrain.  Test 3 used bogotrain.pl.  Each of the scripts was run 4 
times and "extinction" was achieved.  After training, I scored 101 spam and 
616 non-spam.

                   training            scoring
                 spam  good       spam        good
bogofilter       75   585      101 (100%)  616 (100%)
randomtrain      24    27      101 (100%)  616 (100%)
bogotrain.pl      7     4       94 ( 93%)  605 ( 98%)

While these results are in no way definitive, they indicate that the very 
small wordlists produced by bogotrain.pl are inadequate.

Notes:

1) The spam used in the scoring are the usual UCE spam, not nigerian 
spam.  They should all be scored as ham.
2) I expected to see some incorrect classifications in all the tests and am 
surprised that 2 tests didn't have any.

David





More information about the Bogofilter mailing list