training to exhaustion and the risk of overvaluing irrelevant tokens

Thu Aug 14 14:21:23 CEST 2003

At 02:55 AM 8/14/03, Boris 'pi' Piwinger wrote:
>Greg Louis <glouis at dynamicro.on.ca> wrote:
>
> >In current editions of the FAQ, mention is made of the risk one takes
> >in training to exhaustion (taking messages bogofilter misclassifies or
> >is unsure about, and retraining with them till bogofilter gets them
> >right).  If one does this, irrelevant tokens present in such messages
> >acquire higher counts than they ought to, and may for a time degrade
> >bogofilter's classification accuracy.
>
>I doubt that. Any choice of messages to train with, like any
>train on error approach has the risk that those "irrelevant
>tokens" in the chosen messages accumulate.
>
>But when training to exhaustion is done this problem -- if
>it happens -- is seen and corrected, that is the point of
>repeated checking of the messages.

pi,

The present methods of "train-to-exhaustion" are simplistic.  Each message 
is scored and, if there's an error, training happens.  The process repeats 
until no errors are encountered.

The difficulty is that each message _can_ err more than once (in the course 
of multiple passes), hence be trained on more than once.  So the token and 
message counts no longer reflect a unique set of messages, as the Bayesian 
theorem assumes.

If the "train-to-exhaustion" method remembered which messages were used for 
training and skipped those messages on subsequent passes, the problem of 
multiple training would go away.

As a test, modify bogominitrain.pl so it keeps track of messages used for 
training and skips them on subsequent passes.  Then see what happens with 
the modified script.

David