randomtrain vs bogotrain.pl

Fri Jul 4 16:46:26 CEST 2003

David Relson wrote:

>> >                    training            scoring
>> >                  spam  good       spam        good
>> > bogofilter       75   585      101 (100%)  616 (100%)
>> > randomtrain      24    27      101 (100%)  616 (100%)
>> > bogotrain.pl      7     4       94 ( 93%)  605 ( 98%)
>> >
>> > While these results are in no way definitive, they indicate that the very
>> > small wordlists produced by bogotrain.pl are inadequate.
>>
>>That might be a consequence of you very small and special
>>training set:
> 
> As said, the messages used in scoring were different from the messages used 
> in training.  With all training messages going into the wordlists (the 
> "bogofilter" experiment), the scoring was 100% correct.  With randomtrain's 
> 51 messages in the wordlists (the "randomtrain" experiment), the scoring 
> was 100% correct.  With bogotrain.pl's 11 messages, the scoring was still 
> good (94% and 98%) but not perfect.

Yes, but as you explain below, you would not expect that
those spam messages are detected, you wanted to recognize
Nigerian spam mails and that does not work for any of those
tests.

>> > 1) The spam used in the scoring are the usual UCE spam, not nigerian
>> > spam.  They should all be scored as ham.
>>
>>Actually, most were scored as spam above. So all fail badly;-)
> 
> Sorry, I didn't explain well.  The "spam" and "good" columns under scoring 
> indicate the origins of the messages.  The scores are how many were 
> correctly scored.  Since there was no Nigerian spam in the messages in the 
> "spam" and "good" columns, all messages should have scored as ham.

Yes, I understood that. So all tests failed badly.

> The 
> very small wordlists generated by bogotrain.pl did not do as well in this 
> test as did the other (larger) wordlists.

What does well mean here? Taking you cats and dogs example
you only show German shepherd and want to identify
chihuahua. Now what is correct here? To say that chihuahua
is a German shepherd? Looks like GIGO.

> Conclusion, while bogotrain.pl creates very small wordlists to distinguish 
> Nigerian spam from normal ham, the small wordlists didn't have enough 
> information for classifying other messages.

Yes, but this is no surprise. The approach is to get an
exact understanding of the training set. If you hide
relevant information, this must fail. Elementary statistics;
go to a garage where the entry is only two meters high. You
will be surprised how many trucks you find there. Not all
that many, I guess;-) Looking at that statistics won't tell
you anything about trucks on a road. IOW: The sampling is wrong.

> The larger wordlists from the other two tests did better.

Would be intersting to see why.

>> > 2) I expected to see some incorrect classifications in all the tests 
>> and am

Do newer version of Eudora still break lines like this?

>> > surprised that 2 tests didn't have any.
>>
>>That indeed looks like an error.
> 
> No incorrect classifications is exceptionally good.  It's not an error.

Depending on the property you want to identify.

>>But most important I think is that you training set is way
>>to small. I suggest the following experiment:
> 
> The three tests used identical data sets and gave different 
> results.  That's all I was trying to show.

Sure, but so what?

>>Take all you training set (probably several thousend each).
>>Create three databases as above. Run your real mail trhough
>>it for some days. See where they disagree. The server load
>>should not be too bad with three calls of bogofilter instead
>>of one.
> 
> Sorry, but I don't have enough Nigerian spam to create such a large 
> training set.

I don't mean Nigerian spam. I mean just all spam and ham you
have. That would make a reasonable sample to learn from.

>>PS: I'm about to leave for a week.
> 
> Have a good week!  I'll be gone as well - the Relson family heads out of 
> town tomorrow for 7 days in a cabin on Paradise Lake, which is at the 
> northern tip of Michigan's lower peninsula.  We'll talk more when I get back.

Sounds more like fun than my conference;-)

pi