training to exhaustion and the risk of overvaluing irrelevant tokens

Thu Aug 14 08:55:07 CEST 2003

Greg Louis <glouis at dynamicro.on.ca> wrote:

>In current editions of the FAQ, mention is made of the risk one takes
>in training to exhaustion (taking messages bogofilter misclassifies or
>is unsure about, and retraining with them till bogofilter gets them
>right).  If one does this, irrelevant tokens present in such messages
>acquire higher counts than they ought to, and may for a time degrade
>bogofilter's classification accuracy.

I doubt that. Any choice of messages to train with, like any
train on error approach has the risk that those "irrelevant
tokens" in the chosen messages accumulate.

But when training to exhaustion is done this problem -- if
it happens -- is seen and corrected, that is the point of
repeated checking of the messages.

So the risk you describe above is higher in one training
run, since we don't see the problem and hence cannot correct
it.

>Imagine bogofilter is used to recognize dogs as longhaired or
>shorthaired, and is trained with quite a variety of canines.  Now
>suppose that a short-haired dog with floppy ears gets misclassified as
>long-haired, and a long-haired dog with upright ears gets classified as
>short-haired (the hair length being at the near edge of normal for both
>animals).  Because we're training to exhaustion, we show bogofilter the
>dogs over and over till it gets them right; it takes twenty passes. 
>Guess what?  Bogofilter has learned that dogs with floppy ears are
>usually short-haired and ones with upright ears are long-haired.  

No, you said that we do train with a large variety. And that
is of course the basic assumption, you are not training with
just a few. But this makes sure that the wrong conclusion is
seen, i.e., it will misclassify dogs due to the form of the
ears. That is understood and corrected.

If you do not repeat the learning process, the wrong
understanding can occur, since it is only seen late in the
process, too late to be sure that we see enough
counterexamples.

>The
>next German shepherd and the next St. Bernard both get misclassified. 

You already have seen those breeds in the training, so the
problem is seen in the training.

Thanks for the very convincing argument that repeated
training indeed does improve accuracy. This agrees with the
test data I posted two weeks ago.

pi