exhaustion, was Re: Is bogofilter Bayesian?

Tue Feb 10 15:20:17 CET 2004

Greg Louis wrote:

> "I'm not in a position to state that choosing the messages would
> definitely do no harm"
> 
> We all seem to agree that testing is the only way to be sure.  Much as
> I dislike the idea of spending more time on this, I begin to think I
> should assemble a decent-sized corpus and check out training to
> exhaustion once more.  Especially as pi keeps reminding us he's never
> seen any deleterious effect.  I don't know how hard he's looked.

I have sent a lot of test results to this list (links on
http://piology.org/bogofilter/). And I use it in production.
I don't see those bad effects. I don't see some bad effects
others see with normal training (like the random word
discussion). That's all I can offer.

> I should probably propose an experimental design.  I would suggest the
> following: take 40,000 spam and 40,000 nonspam; use 13,000 of each for
> training.  Train once with the 13,000 of each, then bogotune with the
> training db and the remaining 27,000.  Store final counts of false
> positives and false negatives. 

This looks like you want to choose your parameter based on
messages you later want to use for evaluating the
effectiveness of your training. This seems unrealistic.

> Train again with the 13,000 of each and
> bogotune again; also classify the 13,000 separately.  Repeat this
> process until, after training, the 13,000 show no fp or fn when
> classified. 

I actually doubt this converges.

> The repeated bogotuning is required because optimal
> parameter values will change as the training progresses.

That is true.

> Anybody got a better idea?

Actually, this test is dramatically different from my
approach. You basically do repeated full training.

Main question: What exactly do you want to observe?

pi