comments from a new user

Andrew Pimlott andrew at pimlott.net
Sat May 10 14:54:29 CEST 2003


On Sat, May 10, 2003 at 07:57:50AM -0400, Greg Louis wrote:
> On 20030509 (Fri) at 2313:08 -0400, Andrew Pimlott wrote:
> > According to the FAQ, pgood and pbad must trivially add up to 1,
> > which is clearly not true.  :-)
> 
> You are incorrect here.  "The likelihood that a message containing this
> token is spam" can be, say, 0.1, and the likelihood that a message
> containing this same token is nonspam can be 0.05.  There is nothing in
> the description (which I didn't like when I first saw it, but which in
> fact is more accurate than any other we've come up with) that implies
> that the two should add to 1.  You mistakenly suppose that pgood and
> pbad are probabilities; the choice of the word "likelihood" is
> deliberate, and is meant to imply that they represent only very rough
> estimates of probability based on limited information.

Even if likelihoods are only estimates of probability, assuming you
take the same evidence into account in computing each (it would be
silly not to) the likelihoods of P and ~P must add to 1.

> >     "the likelihood (extrapolated from your registered non-spam
> >     messages) that a non-spam message contains this token"
> 
> That would be extremely inaccurate.  pgood tells us nothing about the
> likelihood that a nonspam message contains the token;

Huh?  This is one of the premises on which bogofilter's Baysianish
analysis is based.

> it addresses the
> likelihood that a message containing the token is nonspam.

This is exactly backwards.  In the following line from -vvv,

                                         n    pgood     pbad      fw     U
    "andrew"                           785  0.685078  0.226744  0.248672 +

fw is (at least roughly) the likelihood that a message containing
"andrew" is spam.  Not pgood or pbad.

Anyhow, the new wording is perfect.

Andrew




More information about the Bogofilter mailing list