randomtrain vs bogotrain.pl

Fri Jul 4 17:12:36 CEST 2003

At 10:46 AM 7/4/03, Boris 'pi' Piwinger wrote:
>David Relson wrote:
>
> >> >                    training            scoring
> >> >                  spam  good       spam        good
> >> > bogofilter       75   585      101 (100%)  616 (100%)
> >> > randomtrain      24    27      101 (100%)  616 (100%)
> >> > bogotrain.pl      7     4       94 ( 93%)  605 ( 98%)
> >> >
> >> > While these results are in no way definitive, they indicate that the 
> very
> >> > small wordlists produced by bogotrain.pl are inadequate.
> >>
> >>That might be a consequence of you very small and special
> >>training set:
> >
> > As said, the messages used in scoring were different from the messages 
> used
> > in training.  With all training messages going into the wordlists (the
> > "bogofilter" experiment), the scoring was 100% correct.  With 
> randomtrain's
> > 51 messages in the wordlists (the "randomtrain" experiment), the scoring
> > was 100% correct.  With bogotrain.pl's 11 messages, the scoring was still
> > good (94% and 98%) but not perfect.
>
>Yes, but as you explain below, you would not expect that
>those spam messages are detected, you wanted to recognize
>Nigerian spam mails and that does not work for any of those
>tests.

The scoring test had 101 messages originally classified spam and 616 
originally classified ham.  None of them were Nigerian.  The hoped for 
result was to have all 717 messages classified as non-Nigerian.  The two 
larger wordlists got this 100% right, while the smallest wordlist got 18 
messages wrong.

> >> > 1) The spam used in the scoring are the usual UCE spam, not nigerian
> >> > spam.  They should all be scored as ham.
> >>
> >>Actually, most were scored as spam above. So all fail badly;-)
> >
> > Sorry, I didn't explain well.  The "spam" and "good" columns under scoring
> > indicate the origins of the messages.  The scores are how many were
> > correctly scored.  Since there was no Nigerian spam in the messages in the
> > "spam" and "good" columns, all messages should have scored as ham.
>
>Yes, I understood that. So all tests failed badly.

NOT SO.  (See above)

> > The
> > very small wordlists generated by bogotrain.pl did not do as well in this
> > test as did the other (larger) wordlists.
>
>What does well mean here? Taking you cats and dogs example
>you only show German shepherd and want to identify
>chihuahua. Now what is correct here? To say that chihuahua
>is a German shepherd? Looks like GIGO.

I make no claims for the validity of the test.  It's definitely mixing 
apples and oranges.  However, there _does_ seem to be an accuracy effect 
and it _does_ seem to be related to wordlist size.

> > Conclusion, while bogotrain.pl creates very small wordlists to distinguish
> > Nigerian spam from normal ham, the small wordlists didn't have enough
> > information for classifying other messages.
>
>Yes, but this is no surprise. The approach is to get an
>exact understanding of the training set. If you hide
>relevant information, this must fail. Elementary statistics;
>go to a garage where the entry is only two meters high. You
>will be surprised how many trucks you find there. Not all
>that many, I guess;-) Looking at that statistics won't tell
>you anything about trucks on a road. IOW: The sampling is wrong.

I agree that needed information is lacking for test #3.  Given the perfect 
results of tests 1 & 2, there doesn't seem to be a lack of information.

> > The larger wordlists from the other two tests did better.
>
>Would be intersting to see why.

Yes, it would.

> >> > 2) I expected to see some incorrect classifications in all the tests
> >> and am
>
>Do newer version of Eudora still break lines like this?

I don't know.  I assume it's normal line wrap at column 72 (or whatever).

> >> > surprised that 2 tests didn't have any.
> >>
> >>That indeed looks like an error.
> >
> > No incorrect classifications is exceptionally good.  It's not an error.
>
>Depending on the property you want to identify.
>
> >>But most important I think is that you training set is way
> >>to small. I suggest the following experiment:
> >
> > The three tests used identical data sets and gave different
> > results.  That's all I was trying to show.
>
>Sure, but so what?
>
> >>Take all you training set (probably several thousend each).
> >>Create three databases as above. Run your real mail trhough
> >>it for some days. See where they disagree. The server load
> >>should not be too bad with three calls of bogofilter instead
> >>of one.
> >
> > Sorry, but I don't have enough Nigerian spam to create such a large
> > training set.
>
>I don't mean Nigerian spam. I mean just all spam and ham you
>have. That would make a reasonable sample to learn from.
>
> >>PS: I'm about to leave for a week.
> >
> > Have a good week!  I'll be gone as well - the Relson family heads out of
> > town tomorrow for 7 days in a cabin on Paradise Lake, which is at the
> > northern tip of Michigan's lower peninsula.  We'll talk more when I get 
> back.
>
>Sounds more like fun than my conference;-)

I'm sure you'll learn more.  We'll be seeing several families that we only 
see there.  It's great seeing them all grow.