chi-combining

Tue Nov 19 13:55:48 CET 2002

On 20021118 (Mon) at 2344:07 -0500, Tim Peters wrote:
> [Greg Louis]
> > ...
> > Given that we're applying the calculations to the same individual
> > probability values, I find the congruency of the two calculation
> > methods somewhat reassuring.  The next step is to try to understand
> > whether the spambayes folks get better _binary_ results with
> > chi-squared, or whether the better results are attributable to
> > chi-squared facilitating the identification of problem cases?
> 
> I expect binary results to be the same, usually, provided you're omniscient
> enough to pick the geomean spam cutoff value to 3 decimal digits of
> precision in advance(!).

Yep.

> At least one of our testers gets much better results with chi even
> with perfect geomean hindsight, though, and that isn't understood.

I saw a report yesterday from one tester who claims much better results
with original Graham, and that isn't understood either :)

> Our testing framework keeps track of FP, FN, unsures, and various drivers
> exercise it.  The driver most people use does 10-fold cross-validation.
> That means the total ham and spam gets split up at random into 10 sets of
> pairs, then 10 runs are done, each training on 9 of the pairs and predicting
> against the remaining pair.  The output from that is voluminous.  rates.py
> produces summary statistics from the output, cmp.py compares two summaries
> side-by-side (usually "before" and "after" some change), and table.py
> produces *really* brief summaries from any number of output files.  The
> kinds of stats produced are raw counts, error rates, ham and spam score min,
> max, median, mean and standard deviation, percentile points beyond just the
> median, a measure of ham-vs-spam separation based on their respective score
> means and sdevs, score histograms, and automated analysis deducing the best
> possible cutoff values based on minimizing a 3-term linear cost function.
> Etc <wink>.

Nice!  Thanks for the detail!

> The *important* part of all that is, and *especially* if you've only got one
> test corpus, to slice and dice it to get multiple independent runs out of
> it.  If some change helps in 7 of 10 runs and doesn't matter in the other 3,
> it's probably a winner.  If it helps in 3, hurts in 3, and doesn't matter in
> 4, it's probably a waste of time.  The project's TESTING.txt says more about
> that.

Amen.  I've used that approach in both of the big tests
(http://www.bgl.nu/~glouis/bogofilter links to them) that I've done
recently.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |