exhaustion

Tue Feb 10 15:14:51 CET 2004

On 20040210 (Tue) at 0846:35 -0500, Greg Louis wrote:

> I should probably propose an experimental design.  I would suggest the
> following: take 40,000 spam and 40,000 nonspam; use 13,000 of each for
> training.  Train once with the 13,000 of each, then bogotune with the
> training db and the remaining 27,000.  Store final counts of false
> positives and false negatives.  Train again with the 13,000 of each and
> bogotune again; also classify the 13,000 separately.  Repeat this
> process until, after training, the 13,000 show no fp or fn when
> classified.  The repeated bogotuning is required because optimal
> parameter values will change as the training progresses.
> 
> Anybody got a better idea?

Yeah, I do.  In the second and subsequent training rounds, I should
only train with errors and unsures from the original 13,000 (always
reclassifying all 13,000 though).  This can lead to failure to converge
(the simplest case of which is that training with message A causes
message B to become wrongly classified and vice versa) -- I saw that a
few times when I was testing exhaustion before.  So I should set an
epsilon of four or so: four or fewer classification errors, not zero,
should be the criterion for stopping.

Comments still welcome...
-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |