exhaustion, was Re: Is bogofilter Bayesian?
glouis at dynamicro.on.ca
Tue Feb 10 08:46:35 EST 2004
Just had a message from Gary:
>> In his opinion the choice of messages for
>> training on error also does no harm to this concept, hence
>> the warning would be inappropriate.
> I'd be surprised if he were to confirm that your interpretation of
> his opinion is accurate here.
"I'm not in a position to state that choosing the messages would
definitely do no harm"
We all seem to agree that testing is the only way to be sure. Much as
I dislike the idea of spending more time on this, I begin to think I
should assemble a decent-sized corpus and check out training to
exhaustion once more. Especially as pi keeps reminding us he's never
seen any deleterious effect. I don't know how hard he's looked.
I should probably propose an experimental design. I would suggest the
following: take 40,000 spam and 40,000 nonspam; use 13,000 of each for
training. Train once with the 13,000 of each, then bogotune with the
training db and the remaining 27,000. Store final counts of false
positives and false negatives. Train again with the 13,000 of each and
bogotune again; also classify the 13,000 separately. Repeat this
process until, after training, the 13,000 show no fp or fn when
classified. The repeated bogotuning is required because optimal
parameter values will change as the training progresses.
Anybody got a better idea?
| G r e g L o u i s | gpg public key: 0x400B1AA86D9E3E64 |
| http://www.bgl.nu/~glouis | (on my website or any keyserver) |
| http://wecanstopspam.org in signatures helps fight junk email. |
More information about the Bogofilter