glouis at dynamicro.on.ca
Tue Nov 19 07:55:48 EST 2002
On 20021118 (Mon) at 2344:07 -0500, Tim Peters wrote:
> [Greg Louis]
> > ...
> > Given that we're applying the calculations to the same individual
> > probability values, I find the congruency of the two calculation
> > methods somewhat reassuring. The next step is to try to understand
> > whether the spambayes folks get better _binary_ results with
> > chi-squared, or whether the better results are attributable to
> > chi-squared facilitating the identification of problem cases?
> I expect binary results to be the same, usually, provided you're omniscient
> enough to pick the geomean spam cutoff value to 3 decimal digits of
> precision in advance(!).
> At least one of our testers gets much better results with chi even
> with perfect geomean hindsight, though, and that isn't understood.
I saw a report yesterday from one tester who claims much better results
with original Graham, and that isn't understood either :)
> Our testing framework keeps track of FP, FN, unsures, and various drivers
> exercise it. The driver most people use does 10-fold cross-validation.
> That means the total ham and spam gets split up at random into 10 sets of
> pairs, then 10 runs are done, each training on 9 of the pairs and predicting
> against the remaining pair. The output from that is voluminous. rates.py
> produces summary statistics from the output, cmp.py compares two summaries
> side-by-side (usually "before" and "after" some change), and table.py
> produces *really* brief summaries from any number of output files. The
> kinds of stats produced are raw counts, error rates, ham and spam score min,
> max, median, mean and standard deviation, percentile points beyond just the
> median, a measure of ham-vs-spam separation based on their respective score
> means and sdevs, score histograms, and automated analysis deducing the best
> possible cutoff values based on minimizing a 3-term linear cost function.
> Etc <wink>.
Nice! Thanks for the detail!
> The *important* part of all that is, and *especially* if you've only got one
> test corpus, to slice and dice it to get multiple independent runs out of
> it. If some change helps in 7 of 10 runs and doesn't matter in the other 3,
> it's probably a winner. If it helps in 3, hurts in 3, and doesn't matter in
> 4, it's probably a waste of time. The project's TESTING.txt says more about
Amen. I've used that approach in both of the big tests
(http://www.bgl.nu/~glouis/bogofilter links to them) that I've done
| G r e g L o u i s | gpg public key: |
| http://www.bgl.nu/~glouis | finger greg at bgl.nu |
More information about the Bogofilter