training to exhaustion and the risk of overvaluing irrelevant tokens

Thu Aug 14 15:19:57 CEST 2003

David Relson wrote:

>> > If the "train-to-exhaustion" method remembered which messages were used for
>> > training and skipped those messages on subsequent passes, the problem of
>> > multiple training would go away.
>>
>>This could easily be implemented. The bad thing is that this
>>might just keep one error left so it is hard to tell when to
>>stop. Well, I could add the condition that no message was added.
> 
> That would be the proper way to do it.

I'll send you a version later today to be included into
0.14.5 in addition to the lexer patch and the -T option. Is
there more to come?

>>But the problem described above is independend of this
>>double-training. It (overvaluing irrelevant tokens) can
>>happen with or without double training, just by the choice
>>of messages to train with. But as I argue above there are
>>less problems to expect because of the repeated training as
>>compared to just one run.
> 
> In practical terms, training to exhaustion may, indeed, work well.  As Greg 
> points out, it violates a basic assumption on which the Bayesian theorem is 
> based.

Right, but as I said, already one repetition, i.e., train on
error in its simplest form violates that as well.

> He also points out how training to exhaustion _can_ lead to trouble.

That argument is also true for train on error.

pi