training to exhaustion and the risk of overvaluing irrelevant tokens

David Relson relson at osagesoftware.com
Thu Aug 14 14:21:23 CEST 2003


At 02:55 AM 8/14/03, Boris 'pi' Piwinger wrote:
>Greg Louis <glouis at dynamicro.on.ca> wrote:
>
> >In current editions of the FAQ, mention is made of the risk one takes
> >in training to exhaustion (taking messages bogofilter misclassifies or
> >is unsure about, and retraining with them till bogofilter gets them
> >right).  If one does this, irrelevant tokens present in such messages
> >acquire higher counts than they ought to, and may for a time degrade
> >bogofilter's classification accuracy.
>
>I doubt that. Any choice of messages to train with, like any
>train on error approach has the risk that those "irrelevant
>tokens" in the chosen messages accumulate.
>
>But when training to exhaustion is done this problem -- if
>it happens -- is seen and corrected, that is the point of
>repeated checking of the messages.

pi,

The present methods of "train-to-exhaustion" are simplistic.  Each message 
is scored and, if there's an error, training happens.  The process repeats 
until no errors are encountered.

The difficulty is that each message _can_ err more than once (in the course 
of multiple passes), hence be trained on more than once.  So the token and 
message counts no longer reflect a unique set of messages, as the Bayesian 
theorem assumes.

If the "train-to-exhaustion" method remembered which messages were used for 
training and skipped those messages on subsequent passes, the problem of 
multiple training would go away.

As a test, modify bogominitrain.pl so it keeps track of messages used for 
training and skips them on subsequent passes.  Then see what happens with 
the modified script.

David





More information about the Bogofilter mailing list