training to exhaustion and the risk of overvaluing irrelevant tokens
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Thu Aug 14 15:19:57 CEST 2003
David Relson wrote:
>> > If the "train-to-exhaustion" method remembered which messages were used for
>> > training and skipped those messages on subsequent passes, the problem of
>> > multiple training would go away.
>>
>>This could easily be implemented. The bad thing is that this
>>might just keep one error left so it is hard to tell when to
>>stop. Well, I could add the condition that no message was added.
>
> That would be the proper way to do it.
I'll send you a version later today to be included into
0.14.5 in addition to the lexer patch and the -T option. Is
there more to come?
>>But the problem described above is independend of this
>>double-training. It (overvaluing irrelevant tokens) can
>>happen with or without double training, just by the choice
>>of messages to train with. But as I argue above there are
>>less problems to expect because of the repeated training as
>>compared to just one run.
>
> In practical terms, training to exhaustion may, indeed, work well. As Greg
> points out, it violates a basic assumption on which the Bayesian theorem is
> based.
Right, but as I said, already one repetition, i.e., train on
error in its simplest form violates that as well.
> He also points out how training to exhaustion _can_ lead to trouble.
That argument is also true for train on error.
pi
More information about the Bogofilter
mailing list