Robinson-Fisher (aka chi) beats Bayes chain rule

Wed Dec 4 22:16:27 CET 2002

> 
>    Date: Wed, 04 Dec 2002 15:26:37 -0500
>    From: Scott Lenser <slenser at cs.cmu.edu>
>    X-Spam-Status: No, hits=-9.5 required=7.0
> 	   tests=IN_REP_TO,QUOTED_EMAIL_TEXT,REFERENCES,SPAM_PHRASE_02_03
> 	   version=2.41
>    X-Spam-Level: 
> 
>    > > A few things to point out, some of which you've already noticed:
>    > > 
>    > > 1) You're right, it's extremely dogmatic as the number of features
>    > > goes up.  You need only a few hundred features to bang into the
>    > > 10^-308 or so hard-limit of IEEE floating point.
>    > 
>    > Yup.  My problem is, of course, that a double between 1 and 1 - 1.0e-16
>    > is indistinguishable from 1; that's why I get at least 71 1.0's at the
>    > top of the nonspam list.
>    > 
> 
>    A common technique used by AI researchers to get around this is to use the
>    log of the probability.  So instead of multiplying together a bunch of
>    probabilities you instead add the log of the probabilities.  Makes it much
>    more stable numerically and since log is monotonic you can directly compare
>    the log probabilities to decide which class is more likely.
> 
> Logs actually don't work well for this on the side of probabilities close
> to 0, because numbers like 0.00000000001 have logs that are like -25 or 
> so, while log (.9999999999999) is a very small negative number.  You 
> quickly underflow significance of "very likely" events with only a few
> hundred unlikely events.
> 
> I don't have a good answer to this, but Dale Worley has suggested use of
> the tanh function as one with the right properties.  I haven't implemented
> it so I can't tell you if it works well or not.
>    
> My "it seems to work" answer is to keep both P(good) and P(evil) as
> separate probabilities, run one step of the chain rule (or the
> equivalent) on both, then recalculate the larger from the smaller as
> P(bigger) = 1 - P(smaller).  The one with the lower value will control
> the higher, and so you don't run out of significance till you hit IEEE
> floating point's limit at 10^-300+ or so.
> 
> You can steal the code for this at crm114.sourceforge.net.
> 
>    -Bill Y.
> 
> 

Let me explain what I meant but didn't really say.

Usually this is done by keeping the P(good) and P(evil) separately.  This
probabilities usually aren't kept normalized since they are only used for
a comparison at the end.  So for the BCR method it would look something like
this:

instead of

P(S|A_1) = P(A_1|S) * P(S)
           --------------------------------
           P(A|S) * P(S) + P(A|NS) * P(NS)

it would look like this:

let alpha = P(A|S) * P(S) + P(A|NS) * P(NS)

P(S|A_1) = P(A_1|S) * P(S) / alpha
P(NS|A_1) = P(A_1|NS) * P(NS) / alpha

The alpha doesn't affect the comparison of P(S|A_1) > P(NS|A_1) so it can be dropped
without affecting the result.

log P(S|A_1) = log(P(A_1|S)) + log(P(S))
log P(NS|A_1) = log(P(A_1|NS)) + log(P(NS))

so for a full message, this becomes:

log P(S|A_1, A_2, A_3, A_4)  = log(P(A_1|S))  + log(P(A_2|S))  + log(P(A_3|S))  + log(P(A_4|S))  + P(S)
log P(NS|A_1, A_2, A_3, A_4) = log(P(A_1|NS)) + log(P(A_2|NS)) + log(P(A_3|NS)) + log(P(A_4|NS)) + P(NS)

and then the comparison becomes P(S|msg) > P(NS|msg) or equivalently log(P(S|msg)) > log(P(NS|msg))

To compare whether P(S|msg) > beta (.999 for example), you can use

log(P(S|msg)) - log(P(NS|msg)) > log(beta / (1-beta))

this works because

log(P(S|msg)) - log(P(NS|msg)) > log(beta / (1-beta))
log(P(S|msg) / P(NS|msg)) > log(beta / (1-beta))
P(S|msg) / P(NS|msg) > beta / (1-beta)
P(S|msg) / P(NS|msg) + 1 > beta / (1-beta) + 1
(P(S|msg) + P(NS|msg)) / P(NS|msg) > 1 / (1-beta)
1 / P(NS|msg) > 1 / (1-beta)
P(NS|msg) < 1-beta
1-P(NS|msg) > 1-(1-beta)
P(S|msg) > beta

- Scott