exhaustion

Tue Feb 10 15:22:00 CET 2004

Greg Louis wrote:

>> I should probably propose an experimental design.  I would suggest the
>> following: take 40,000 spam and 40,000 nonspam; use 13,000 of each for
>> training.  Train once with the 13,000 of each, then bogotune with the
>> training db and the remaining 27,000.  Store final counts of false
>> positives and false negatives.  Train again with the 13,000 of each and
>> bogotune again; also classify the 13,000 separately.  Repeat this
>> process until, after training, the 13,000 show no fp or fn when
>> classified.  The repeated bogotuning is required because optimal
>> parameter values will change as the training progresses.
>> 
>> Anybody got a better idea?
> 
> Yeah, I do.  In the second and subsequent training rounds, I should
> only train with errors and unsures from the original 13,000 (always
> reclassifying all 13,000 though). 

Much better. But why not do it right from the beginning?

> This can lead to failure to converge
> (the simplest case of which is that training with message A causes
> message B to become wrongly classified and vice versa)

Right, you can easily generate that case. But that requires
messages to be almost equal. I haven't observed it in
practice, though.

pi