training to exhaustion and the risk of overvaluing irrelevant tokens

Thu Aug 14 14:41:44 CEST 2003

David Relson wrote:

>> >In current editions of the FAQ, mention is made of the risk one takes
>> >in training to exhaustion (taking messages bogofilter misclassifies or
>> >is unsure about, and retraining with them till bogofilter gets them
>> >right).  If one does this, irrelevant tokens present in such messages
>> >acquire higher counts than they ought to, and may for a time degrade
>> >bogofilter's classification accuracy.
>>
>>I doubt that. Any choice of messages to train with, like any
>>train on error approach has the risk that those "irrelevant
>>tokens" in the chosen messages accumulate.
>>
>>But when training to exhaustion is done this problem -- if
>>it happens -- is seen and corrected, that is the point of
>>repeated checking of the messages.
> 
> The present methods of "train-to-exhaustion" are simplistic.  Each message 
> is scored and, if there's an error, training happens.  The process repeats 
> until no errors are encountered.

Right.

> The difficulty is that each message _can_ err more than once (in the course 
> of multiple passes), hence be trained on more than once.  So the token and 
> message counts no longer reflect a unique set of messages, as the Bayesian 
> theorem assumes.

This is correct, that can happen.

> If the "train-to-exhaustion" method remembered which messages were used for 
> training and skipped those messages on subsequent passes, the problem of 
> multiple training would go away.

This could easily be implemented. The bad thing is that this
might just keep one error left so it is hard to tell when to
stop. Well, I could add the condition that no message was added.

But the problem described above is independend of this
double-training. It (overvaluing irrelevant tokens) can
happen with or without double training, just by the choice
of messages to train with. But as I argue above there are
less problems to expect because of the repeated training as
compared to just one run.

pi