randomtrain vs bogotrain.pl

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Fri Jul 4 18:08:59 CEST 2003


David Relson <relson at osagesoftware.com> wrote:

>> >> >                    training            scoring
>> >> >                  spam  good       spam        good
>> >> > bogofilter       75   585      101 (100%)  616 (100%)
>> >> > randomtrain      24    27      101 (100%)  616 (100%)
>> >> > bogotrain.pl      7     4       94 ( 93%)  605 ( 98%)

>The scoring test had 101 messages originally classified spam and 616 
>originally classified ham.  None of them were Nigerian.  The hoped for 
>result was to have all 717 messages classified as non-Nigerian.  The two 
>larger wordlists got this 100% right, 

I don't understand: Above it says, 616 were recognized as
non-Nigerian (good) and 101 as Nigerian (spam). So there are
101 false classifications.

>I make no claims for the validity of the test.  It's definitely mixing 
>apples and oranges.  However, there _does_ seem to be an accuracy effect 

For a unclear definition of accuracy. Above you say you want
to distinguish Nigerian vs non-Nigerian. But you evaluate
the output on a different criterion, namely spam and ham.

Say we have a maildrop. Two people share it, Alice and Bob.
We want to use bogofilter to split the messages. For
training we use all messages for Bob and those for Alice
which are from Eve who never writes to Bob. Now what is the
correct behavior of the filter? Split into Alice/Bob of
from-Eve/not-from-Eve?

>and it _does_ seem to be related to wordlist size.

I still want to understand what randomtrain does. The readme
does not explain how random it is. Can by accident only ham
be used? When does it stop?

>> > Conclusion, while bogotrain.pl creates very small wordlists to distinguish
>> > Nigerian spam from normal ham, the small wordlists didn't have enough
>> > information for classifying other messages.
>>
>>Yes, but this is no surprise. The approach is to get an
>>exact understanding of the training set. If you hide
>>relevant information, this must fail. Elementary statistics;
>>go to a garage where the entry is only two meters high. You
>>will be surprised how many trucks you find there. Not all
>>that many, I guess;-) Looking at that statistics won't tell
>>you anything about trucks on a road. IOW: The sampling is wrong.
>
>I agree that needed information is lacking for test #3.  Given the perfect 
>results of tests 1 & 2, there doesn't seem to be a lack of information.

The question is what the find. The fail completely to
distinguish Nigerian from non-Nigerian.

>> >> > 2) I expected to see some incorrect classifications in all the tests
>> >> and am
>>
>>Do newer version of Eudora still break lines like this?
>
>I don't know.  I assume it's normal line wrap at column 72 (or whatever).

I see: format flawed.

pi




More information about the Bogofilter mailing list