Robinson-Fisher (aka chi) beats Bayes chain rule

Sat Nov 30 14:22:22 CET 2002

On 20021129 (Fri) at 2200:08 -0500, Bill Yerazunis wrote:

> 
> OUCH!  You _trust_ that code???  :-)  (and, um, thanks. )

Well, I did read it with some care ;-)  But it does look as though it
does what it says it does, so if that be right, we're ok <grin>.

> A few things to point out, some of which you've already noticed:
> 
> 1) You're right, it's extremely dogmatic as the number of features
> goes up.  You need only a few hundred features to bang into the
> 10^-308 or so hard-limit of IEEE floating point.

Yup.  My problem is, of course, that a double between 1 and 1 - 1.0e-16
is indistinguishable from 1; that's why I get at least 71 1.0's at the
top of the nonspam list.

> 2) Part of this dogmaticity is the nature of the way p(w) is calculated.
> The current method is _totally_ ad-hoc; a single feature found in
> ONE element of the training set and never found again contributes 
> a factor of .6666 (or .3333) into the chain rule.  This is far too 
> much; I've toyed with other functions (including a table-driven function)
> for this, but I would like better mathematical justification
> than the current guesswork.
> 
> 3) another part of the dogmaticity is (imho) due to the violation of
> independence on the part of the feature set.  The Bayesian chain rule
> theorem is valid only under the assumption of independence of events;
> in fact our events are _highly_ correllated.  
> 
> For example, I have 619915 hashed features in my current "under test"
> CRM114 discriminator hash files.  Of those:
> 
> 	      76473 are the same
> 	     543442 are different
> 
> A random distribution of features would show about 348K each.  This
> shows the features are highly correllated within the same corpus and
> anti-correllated cross-corpuses.
> 
> I don't know if there is any way to salvage this mathematically.

Gary Robinson has been trying to find a way but so far (as you've found
:) it ain't easy.

> I assume that each word found in each corpus was taken without hashing
> and used as a single "feature" for your training, so the total number
> of features is equal to the number of words in the corpi (and there
> are zero collisions due to hash-clashes)?

Approximately.  The lexical analysis is somewhat smarter than mere
words, but that's the basic idea.

> And, that you trained on 100% of the available training text,
> rather than training only on errors?  (I train only on errors,
> I find that it works better.

I'm finding the same, but for this experiment, yes, I trained on the
whole training corpora (spam and nonspam).  I'm in favour of your
proposed variation (v.i.) and will run it when I get the chance.

> So, not a big big difference.  Suspicious, but not significant
> in and of itself. 
> 
> Which confuses the heck out of me to see:
> 
>    We can calculate the means and 95% confidence limits of these values,
>    and they come out like this:
> 
>        calc meanfnpc lcl95 ucl95
>    1    chi     1.51  1.45  1.56
>    2 bcr-fw     1.79  1.73  1.85

The corrected values are

    calc meanfnpc lcl95 ucl95
1    chi     1.51  1.38  1.63
2 bcr-fw     1.79  1.66  1.92

and the diff in means is _just_ "significant" at the 0.05 level.  This
is with the run effect factored out, which is what's different from
your test that lumps the run effect in with residuals.  See the writeup
later this weekend; I'll go into this a bit more.  Think "pairwise" in
the meantime.  Also, did you see what happened when I bumped the cutoff
down to 71 from 84?  (This might be the train-on-everything effect; we
shall see.)

    calc run fneg  percent
1    chi   0   30 2.118644
2    chi   1   25 1.765537
3    chi   2   27 1.906780
4 bcr-fw   0   91 6.426554
5 bcr-fw   1   84 5.932203
6 bcr-fw   2   96 6.779661

    calc meanfnpc lcl95 ucl95
1    chi     1.93  1.25  2.61
2 bcr-fw     6.38  5.70  7.06

> Your experimental observations are quite right- I have only seen 
> _one_ message out of 5000+ in the last month that came through 
> BCR with a probability within the band .0001 ... .9999 .  
> 
> One further experiment I would suggest that you run is to make the
> runs again, but training each of the three classifiers ONLY on errors
> (that is, you shuffle the training corpus as to spam and nonspam, and
> submit the messages one at a time, in random order, to the untrained
> classifier.

This I shall do.  If I feel enthusiastic enough after writing the
month-end and quarter-end reports for my employer, I may even get that
_and_ the writeup done this weekend :)

> It will be interesting to see if the results of that "errors-only"
> training improve BCR, and if they also improve Robinson Geometric
> Mean, and Robinson-Fisher.

Well, I'll do Bcr (with fw) and Robinson-Fisher; I'm not tooled up to
do geometric-mean at the moment.

> I want to see if I can come up with a way to salvage the assumption
> of independence, or, failing that, renormalize event counts so that
> the significance of low-count events is more properly assessed.

Gary's f(w) technique addresses the latter to some extent, if I've
understood you rightly.

> (and thanks!  Good to see people doing real science)
Kinder words could hardly be found :)

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |