randomtrain vs bogotrain.pl

Fri Jul 4 15:36:29 CEST 2003

David Relson wrote:

> Today I've run another test set with 75 Nigerian spam and 585 non-spam.  As 
> test 1, I put _all_ the messages in spamlist and goodlist.  Test 2 used 
> randomtrain.  Test 3 used bogotrain.pl.  Each of the scripts was run 4 
> times and "extinction" was achieved. 

Using -f you only need to start my script once.

> After training, I scored 101 spam and 
> 616 non-spam.
> 
>                    training            scoring
>                  spam  good       spam        good
> bogofilter       75   585      101 (100%)  616 (100%)
> randomtrain      24    27      101 (100%)  616 (100%)
> bogotrain.pl      7     4       94 ( 93%)  605 ( 98%)
> 
> While these results are in no way definitive, they indicate that the very 
> small wordlists produced by bogotrain.pl are inadequate.

That might be a consequence of you very small and special
training set:

> 1) The spam used in the scoring are the usual UCE spam, not nigerian 
> spam.  They should all be scored as ham.

Actually, most were scored as spam above. So all fail badly;-)

> 2) I expected to see some incorrect classifications in all the tests and am 
> surprised that 2 tests didn't have any.

That indeed looks like an error.

But most important I think is that you training set is way
to small. I suggest the following experiment:

Take all you training set (probably several thousend each).
Create three databases as above. Run your real mail trhough
it for some days. See where they disagree. The server load
should not be too bad with three calls of bogofilter instead
of one.

pi

PS: I'm about to leave for a week.