randomtrain vs bogotrain.pl

Fri Jul 4 16:30:13 CEST 2003

At 09:36 AM 7/4/03, Boris 'pi' Piwinger wrote:
>David Relson wrote:
>
> > Today I've run another test set with 75 Nigerian spam and 585 
> non-spam.  As
> > test 1, I put _all_ the messages in spamlist and goodlist.  Test 2 used
> > randomtrain.  Test 3 used bogotrain.pl.  Each of the scripts was run 4
> > times and "extinction" was achieved.
>
>Using -f you only need to start my script once.
>
> > After training, I scored 101 spam and
> > 616 non-spam.
> >
> >                    training            scoring
> >                  spam  good       spam        good
> > bogofilter       75   585      101 (100%)  616 (100%)
> > randomtrain      24    27      101 (100%)  616 (100%)
> > bogotrain.pl      7     4       94 ( 93%)  605 ( 98%)
> >
> > While these results are in no way definitive, they indicate that the very
> > small wordlists produced by bogotrain.pl are inadequate.
>
>That might be a consequence of you very small and special
>training set:

As said, the messages used in scoring were different from the messages used 
in training.  With all training messages going into the wordlists (the 
"bogofilter" experiment), the scoring was 100% correct.  With randomtrain's 
51 messages in the wordlists (the "randomtrain" experiment), the scoring 
was 100% correct.  With bogotrain.pl's 11 messages, the scoring was still 
good (94% and 98%) but not perfect.

> > 1) The spam used in the scoring are the usual UCE spam, not nigerian
> > spam.  They should all be scored as ham.
>
>Actually, most were scored as spam above. So all fail badly;-)

Sorry, I didn't explain well.  The "spam" and "good" columns under scoring 
indicate the origins of the messages.  The scores are how many were 
correctly scored.  Since there was no Nigerian spam in the messages in the 
"spam" and "good" columns, all messages should have scored as ham.  The 
very small wordlists generated by bogotrain.pl did not do as well in this 
test as did the other (larger) wordlists.

Conclusion, while bogotrain.pl creates very small wordlists to distinguish 
Nigerian spam from normal ham, the small wordlists didn't have enough 
information for classifying other messages.  The larger wordlists from the 
other two tests did better.

> > 2) I expected to see some incorrect classifications in all the tests 
> and am
> > surprised that 2 tests didn't have any.
>
>That indeed looks like an error.

No incorrect classifications is exceptionally good.  It's not an error.

>But most important I think is that you training set is way
>to small. I suggest the following experiment:

The three tests used identical data sets and gave different 
results.  That's all I was trying to show.

>Take all you training set (probably several thousend each).
>Create three databases as above. Run your real mail trhough
>it for some days. See where they disagree. The server load
>should not be too bad with three calls of bogofilter instead
>of one.

Sorry, but I don't have enough Nigerian spam to create such a large 
training set.

>pi
>
>PS: I'm about to leave for a week.

Have a good week!  I'll be gone as well - the Relson family heads out of 
town tomorrow for 7 days in a cabin on Paradise Lake, which is at the 
northern tip of Michigan's lower peninsula.  We'll talk more when I get back.

David