randomtrain vs bogotrain.pl
David Relson
relson at osagesoftware.com
Sat Jul 5 01:18:43 CEST 2003
At 12:08 PM 7/4/03, Boris 'pi' Piwinger wrote:
>David Relson <relson at osagesoftware.com> wrote:
>
> >> >> > training scoring
> >> >> > spam good spam good
> >> >> > bogofilter 75 585 101 (100%) 616 (100%)
> >> >> > randomtrain 24 27 101 (100%) 616 (100%)
> >> >> > bogotrain.pl 7 4 94 ( 93%) 605 ( 98%)
>
> >The scoring test had 101 messages originally classified spam and 616
> >originally classified ham. None of them were Nigerian. The hoped for
> >result was to have all 717 messages classified as non-Nigerian. The two
> >larger wordlists got this 100% right,
>
>I don't understand: Above it says, 616 were recognized as
>non-Nigerian (good) and 101 as Nigerian (spam). So there are
>101 false classifications.
One last time...
Normally I use bogofilter to separate spam and ham. Nigerian scams are one
group of messages that I classify as spam. As I wanted to know if
bogofilter could be used to detect Nigerian scams, I scanned my spam
folders and found 75 Nigerian scam messages. Then I trained bogofilter
with the 75 Nigerian spam and 585 ham (in the non-UCE sense). After that
training, I scored 717 messages. Of the 717 messages, 101 came from my
spam corpus (in the UCE sense) and 616 came from my non-spam corpus. None
of the 717 messages were Nigerian spam.
I was curious to see how well the 3 different wordlists would do in
recognizing these 717 messages (none of which were Nigerian spam). The
first two wordlists got perfect scores - all the messages were correctly
scored as _not_ Nigerian spam. With the third wordlist, 699 messages were
scored correctly and 18 messages were scored incorrectly.
Conclusion, the third wordlist was too small to provide a good result.
Unanswered question: Why do the two train-on-error scripts give such
different results?
More information about the Bogofilter
mailing list