randomtrain vs bogotrain.pl

Sat Jul 5 01:18:43 CEST 2003

At 12:08 PM 7/4/03, Boris 'pi' Piwinger wrote:
>David Relson <relson at osagesoftware.com> wrote:
>
> >> >> >                    training            scoring
> >> >> >                  spam  good       spam        good
> >> >> > bogofilter       75   585      101 (100%)  616 (100%)
> >> >> > randomtrain      24    27      101 (100%)  616 (100%)
> >> >> > bogotrain.pl      7     4       94 ( 93%)  605 ( 98%)
>
> >The scoring test had 101 messages originally classified spam and 616
> >originally classified ham.  None of them were Nigerian.  The hoped for
> >result was to have all 717 messages classified as non-Nigerian.  The two
> >larger wordlists got this 100% right,
>
>I don't understand: Above it says, 616 were recognized as
>non-Nigerian (good) and 101 as Nigerian (spam). So there are
>101 false classifications.

One last time...

Normally I use bogofilter to separate spam and ham.  Nigerian scams are one 
group of messages that I classify as spam.  As I wanted to know if 
bogofilter could be used to detect Nigerian scams, I scanned my spam 
folders and found 75 Nigerian scam messages.  Then I trained bogofilter 
with the 75 Nigerian spam and 585 ham (in the non-UCE sense).  After that 
training, I scored 717 messages.  Of the 717 messages, 101 came from my 
spam corpus (in the UCE sense) and 616 came from my non-spam corpus.  None 
of the 717 messages were Nigerian spam.

I was curious to see how well the 3 different wordlists would do in 
recognizing these 717 messages (none of which were Nigerian spam).  The 
first two wordlists got perfect scores - all the messages were correctly 
scored as _not_ Nigerian spam.  With the third wordlist, 699 messages were 
scored correctly and 18 messages were scored incorrectly.

Conclusion, the third wordlist was too small to provide a good result.

Unanswered question:  Why do the two train-on-error scripts give such 
different results?