Robinson-Fisher (aka chi) beats Bayes chain rule
Scott Lenser
slenser at cs.cmu.edu
Wed Dec 4 22:16:27 CET 2002
>
> Date: Wed, 04 Dec 2002 15:26:37 -0500
> From: Scott Lenser <slenser at cs.cmu.edu>
> X-Spam-Status: No, hits=-9.5 required=7.0
> tests=IN_REP_TO,QUOTED_EMAIL_TEXT,REFERENCES,SPAM_PHRASE_02_03
> version=2.41
> X-Spam-Level:
>
> > > A few things to point out, some of which you've already noticed:
> > >
> > > 1) You're right, it's extremely dogmatic as the number of features
> > > goes up. You need only a few hundred features to bang into the
> > > 10^-308 or so hard-limit of IEEE floating point.
> >
> > Yup. My problem is, of course, that a double between 1 and 1 - 1.0e-16
> > is indistinguishable from 1; that's why I get at least 71 1.0's at the
> > top of the nonspam list.
> >
>
> A common technique used by AI researchers to get around this is to use the
> log of the probability. So instead of multiplying together a bunch of
> probabilities you instead add the log of the probabilities. Makes it much
> more stable numerically and since log is monotonic you can directly compare
> the log probabilities to decide which class is more likely.
>
> Logs actually don't work well for this on the side of probabilities close
> to 0, because numbers like 0.00000000001 have logs that are like -25 or
> so, while log (.9999999999999) is a very small negative number. You
> quickly underflow significance of "very likely" events with only a few
> hundred unlikely events.
>
> I don't have a good answer to this, but Dale Worley has suggested use of
> the tanh function as one with the right properties. I haven't implemented
> it so I can't tell you if it works well or not.
>
> My "it seems to work" answer is to keep both P(good) and P(evil) as
> separate probabilities, run one step of the chain rule (or the
> equivalent) on both, then recalculate the larger from the smaller as
> P(bigger) = 1 - P(smaller). The one with the lower value will control
> the higher, and so you don't run out of significance till you hit IEEE
> floating point's limit at 10^-300+ or so.
>
> You can steal the code for this at crm114.sourceforge.net.
>
> -Bill Y.
>
>
Let me explain what I meant but didn't really say.
Usually this is done by keeping the P(good) and P(evil) separately. This
probabilities usually aren't kept normalized since they are only used for
a comparison at the end. So for the BCR method it would look something like
this:
instead of
P(S|A_1) = P(A_1|S) * P(S)
--------------------------------
P(A|S) * P(S) + P(A|NS) * P(NS)
it would look like this:
let alpha = P(A|S) * P(S) + P(A|NS) * P(NS)
P(S|A_1) = P(A_1|S) * P(S) / alpha
P(NS|A_1) = P(A_1|NS) * P(NS) / alpha
The alpha doesn't affect the comparison of P(S|A_1) > P(NS|A_1) so it can be dropped
without affecting the result.
log P(S|A_1) = log(P(A_1|S)) + log(P(S))
log P(NS|A_1) = log(P(A_1|NS)) + log(P(NS))
so for a full message, this becomes:
log P(S|A_1, A_2, A_3, A_4) = log(P(A_1|S)) + log(P(A_2|S)) + log(P(A_3|S)) + log(P(A_4|S)) + P(S)
log P(NS|A_1, A_2, A_3, A_4) = log(P(A_1|NS)) + log(P(A_2|NS)) + log(P(A_3|NS)) + log(P(A_4|NS)) + P(NS)
and then the comparison becomes P(S|msg) > P(NS|msg) or equivalently log(P(S|msg)) > log(P(NS|msg))
To compare whether P(S|msg) > beta (.999 for example), you can use
log(P(S|msg)) - log(P(NS|msg)) > log(beta / (1-beta))
this works because
log(P(S|msg)) - log(P(NS|msg)) > log(beta / (1-beta))
log(P(S|msg) / P(NS|msg)) > log(beta / (1-beta))
P(S|msg) / P(NS|msg) > beta / (1-beta)
P(S|msg) / P(NS|msg) + 1 > beta / (1-beta) + 1
(P(S|msg) + P(NS|msg)) / P(NS|msg) > 1 / (1-beta)
1 / P(NS|msg) > 1 / (1-beta)
P(NS|msg) < 1-beta
1-P(NS|msg) > 1-(1-beta)
P(S|msg) > beta
- Scott
More information about the Bogofilter
mailing list