training to exhaustion and the risk of overvaluing irrelevant tokens

David Relson relson at osagesoftware.com
Thu Aug 14 14:54:01 CEST 2003


At 08:41 AM 8/14/03, Boris 'pi' Piwinger wrote:
>David Relson wrote:
>
> >> >In current editions of the FAQ, mention is made of the risk one takes
> >> >in training to exhaustion (taking messages bogofilter misclassifies or
> >> >is unsure about, and retraining with them till bogofilter gets them
> >> >right).  If one does this, irrelevant tokens present in such messages
> >> >acquire higher counts than they ought to, and may for a time degrade
> >> >bogofilter's classification accuracy.
> >>
> >>I doubt that. Any choice of messages to train with, like any
> >>train on error approach has the risk that those "irrelevant
> >>tokens" in the chosen messages accumulate.
> >>
> >>But when training to exhaustion is done this problem -- if
> >>it happens -- is seen and corrected, that is the point of
> >>repeated checking of the messages.
> >
> > The present methods of "train-to-exhaustion" are simplistic.  Each message
> > is scored and, if there's an error, training happens.  The process repeats
> > until no errors are encountered.
>
>Right.
>
> > The difficulty is that each message _can_ err more than once (in the 
> course
> > of multiple passes), hence be trained on more than once.  So the token and
> > message counts no longer reflect a unique set of messages, as the Bayesian
> > theorem assumes.
>
>This is correct, that can happen.
>
> > If the "train-to-exhaustion" method remembered which messages were used 
> for
> > training and skipped those messages on subsequent passes, the problem of
> > multiple training would go away.
>
>This could easily be implemented. The bad thing is that this
>might just keep one error left so it is hard to tell when to
>stop. Well, I could add the condition that no message was added.

That would be the proper way to do it.

>But the problem described above is independend of this
>double-training. It (overvaluing irrelevant tokens) can
>happen with or without double training, just by the choice
>of messages to train with. But as I argue above there are
>less problems to expect because of the repeated training as
>compared to just one run.

In practical terms, training to exhaustion may, indeed, work well.  As Greg 
points out, it violates a basic assumption on which the Bayesian theorem is 
based.  He also points out how training to exhaustion _can_ lead to trouble.





More information about the Bogofilter mailing list