exhaustion
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Tue Feb 10 15:22:00 CET 2004
Greg Louis wrote:
>> I should probably propose an experimental design. I would suggest the
>> following: take 40,000 spam and 40,000 nonspam; use 13,000 of each for
>> training. Train once with the 13,000 of each, then bogotune with the
>> training db and the remaining 27,000. Store final counts of false
>> positives and false negatives. Train again with the 13,000 of each and
>> bogotune again; also classify the 13,000 separately. Repeat this
>> process until, after training, the 13,000 show no fp or fn when
>> classified. The repeated bogotuning is required because optimal
>> parameter values will change as the training progresses.
>>
>> Anybody got a better idea?
>
> Yeah, I do. In the second and subsequent training rounds, I should
> only train with errors and unsures from the original 13,000 (always
> reclassifying all 13,000 though).
Much better. But why not do it right from the beginning?
> This can lead to failure to converge
> (the simplest case of which is that training with message A causes
> message B to become wrongly classified and vice versa)
Right, you can easily generate that case. But that requires
messages to be almost equal. I haven't observed it in
practice, though.
pi
More information about the Bogofilter
mailing list