randomtrain vs bogotrain.pl
David Relson
relson at osagesoftware.com
Fri Jul 4 15:22:34 CEST 2003
Greetings,
Being curious, yesterday I did some experimentation with randomtrain and
pi's bogotrain.pl.
I had approx 55 Nigerian spam and 80 non-spam. Unfortunately, I didn't
save the results, but I do remember that bogotrain.pl found 5 spam and 3
non-spam to be adequate for complete training while randomtrain's numbers
were about 10 higher (15 and 13). The randomtrain numbers changed slightly
in different runs, but were always significantly higher than bogotrain.pl's
numbers. I attributed this to an "ordering effect" and didn't investigate
further. Anyhow, after training with either script, all the Nigerian spam
messages were successfully scored as spam.
Today I've run another test set with 75 Nigerian spam and 585 non-spam. As
test 1, I put _all_ the messages in spamlist and goodlist. Test 2 used
randomtrain. Test 3 used bogotrain.pl. Each of the scripts was run 4
times and "extinction" was achieved. After training, I scored 101 spam and
616 non-spam.
training scoring
spam good spam good
bogofilter 75 585 101 (100%) 616 (100%)
randomtrain 24 27 101 (100%) 616 (100%)
bogotrain.pl 7 4 94 ( 93%) 605 ( 98%)
While these results are in no way definitive, they indicate that the very
small wordlists produced by bogotrain.pl are inadequate.
Notes:
1) The spam used in the scoring are the usual UCE spam, not nigerian
spam. They should all be scored as ham.
2) I expected to see some incorrect classifications in all the tests and am
surprised that 2 tests didn't have any.
David
More information about the Bogofilter
mailing list