better Bayesian bogofilter

Mon Aug 11 18:21:18 CEST 2003

Hi, everyone:

Here's a paper I just finished writing.  We're not doing the p(w)
calculation exactly right in bogofilter, and if we fix it, accuracy
might improve a bit.  There's always a gotcha, though... I'll let the
paper speak for itself:

Really Bayesian Bogofilter -- altering the calculation slightly
Greg Louis, Aug. 2003

Introduction

The way Paul Graham calculates token probability in his article A Plan
for Spam (http://www.paulgraham.com/spam.html) is a simplification of
Bayes' theorem.  I have been wondering if using the simplification is
actually better than just using Bayes' theoremm directly.  What Paul
wrote is:

(let ((g (* 2 (or (gethash word good) 0)))
      (b (or (gethash word bad) 0)))
 (unless (< (+ g b) 5)
  (max .01
       (min .99 (float (/ (min 1 (/ b nbad))
                          (+ (min 1 (/ g ngood))
                             (min 1 (/ b nbad)))))))))

Which in ordinary math, and shorn of limits and omitting doubling
of g, reads

Let g be the number of times a token is found in nonspam
    b be the number of times the token is found in spam
    G be the number of nonspam messages used in building the database
    B be the number of spam messages used in building the database
    p be the likelihood that a mail containing the token is spam

Then

                  b/B
[#1]       p = ---------
               b/B + g/G

We've all been using that, and it works well.  Paul wrote:

"The especially observant will notice that while I consider each corpus
 to be a single long stream of text for purposes of counting
 occurrences, I use the number of emails in each, rather than their
 combined length, as the divisor in calculating spam probabilities. This
 adds another slight bias to protect against false positives."

As I hope to demonstrate below by deriving the equation for p from
first principles, the concept is right but the application may not be.
I will also point out a possible difficulty in training on error, and
suggest one way to work around it.

Derivation

Bayes' rule relates the probability of a binary condition C, given a
positive result T of a test for C, to the power of the test (the
probability of T given C) and the likelihood of C.  If P[x|y] is read
as "the probability of x given y", and C' is the complement of C,
Bayes' rule is

                             P[T|C] * P[C]
[#2]       P[C|T] = ---------------------------------
                    P[T|C] * P[C]  +  P[T|C'] * P[C']

Let's express each of these terms in spam-filtering language.  C is the
condition "the message is spam".  T is "the token is present".

P[C|T] is the likelihood that a message containing the token is spam.

           P[C|T] == p

P[T|C] is the likelihood that a spam will contain the token.  Our
estimate of that is the proportion of spam messages in which the token
was found.

           P[T|C] = b / B

P[C] is the likelihood that a random message is spam.  Our estimate of
that is the proportion of spam messages used to build the training
database.

           P[C] = B / (B + G)

(If we have done full training, this may be a good estimate; but what
if we train on errors and unsures?  The balance may not be perfect, and
to the extent it isn't, our P[C] and P[C'] values will be askew.  It
might be preferable, in such a case, to have bogofilter keep a running
count of the proportion of spam in the messages it has processed, and
use that instead; it will be less than perfectly accurate because of
bogofilter's classification errors, but for this purpose it would
likely be good enough.)

P[T|C'] is the likelihood that a nonspam will contain the token.  Our
estimate of that is the proportion of nonspam messages in which the
token was found.

           P[T|C'] = g / G

P[C'] is the likelihood that a random message is not spam.

           P[C'] = G / (B + G)

Given these definitions, we can see that Paul's equation above,
expressed in the notation I used to present Bayes' rule, is

                        P[T|C]
[#3]       P[C|T] = ----------------
                    P[T|C] + P[T|C']

This simplification is correct if and only if the numbers of spam and
nonspam messages are equal (P[C] == P[C'] == 0.5).

If the training database and the population of messages as a whole have
the same ratio of spam to nonspam, the above definitions apply. 
Substituting them into equation [#2] (Bayes' rule) yields a very simple
result:

                     b    B              b
                     - * ---            ---
                     B   B+G            B+G       b 
[#4]       p = -------------------  ==  ---  ==  ---
               b    B      g    G       b+g      b+g
               - * ---  +  - * ---      --- 
               B   B+G     G   B+G      B+G

Hypothesis

So in fact, for bogofilter to work optimally, it would appear that:

(1) Whatever training method we employ, we should try to keep the
ratio of spam to nonspam the same as it is in the messages we are
receiving.  If we can do that, then we can use p = b / (b+g) to
calculate token probabilities.

(2) If we can't do that, then we need to count spam and nonspam in our
incoming messages in some other way (call these counts B' and G'), and
calculate

                       b     B'
                       - * -----
                       B   B'+G'
[#5]       p = -----------------------
               b     B'      g     G'
               - * -----  +  - * -----
               B   B'+G'     G   B'+G'
instead.

Experimental verification

A corpus of messages collected during April and May of this year
contained 21,345 (56.64%) spam and 16,340 (43.36%) nonspam.  These
messages were classified with bogofilter 0.14.3.  The training database
was generated from another corpus with a similar spam/nonspam ratio,
but fewer spam were used in training, making the database (message
counts 15,240 spam and 21,040 nonspam) deliberately unrepresentative of
the population spam/nonspam ratio.  I did that because I wanted to show
the effect such a distortion would have.

When classified with normal bogofilter (equation #1), the test messages
yielded 136 (0.83%) false positives.  The messages were then classified
with a modified bogofilter using equation [#5], with B' 0.5664 and G'
0.4336.  The spam cutoff was adjusted to give exactly the same number
of false positives.  For comparison, the same was done with another
modified bogofilter using equation [#4].  This would be expected to
perform worst of the three, because of the large discrepancy between
the spam/nonspam ratios of the database and in the message population.

        Equation used    False negatives
            [#1]           311 (1.46%)
            [#5]           303 (1.41%)
            [#4]           377 (1.76%)

These results are as expected.  Taking the population distribution into
account (equation #5) gave the most accurate discrimination, though
ignoring it (equation #1) only worsened discrimination by 3.5%, which
is not extreme.  Using equation #4, clearly unjustified given the
skewed spam/ham ratio, had a more severe effect (24.8%) on the number
of false negatives.  These differences, of course, reflect the degree
to which the spam/nonspam ratio used in calculation differed from that
present in the source population.  The effect, though small, seems to
be of practical significance.

Conclusion

The suggestions presented above under "Hypothesis" should be adopted in
bogofilter.

List of equations

[#1]  Graham's p(w) as currently used in bogofilter
[#2]  Bayes' theorem
[#3]  Graham's p(w) in the notation of [#2]
[#4]  Theoretically correct p(w) if spam/nonspam is the same in the
      training database as in the incoming messages
[#5]  p(w) if spam/nonspam has to be estimated separately in the
      population of incoming messages

Acknowledgement

Thanks go to David Relson for reviewing this document in draft. 
Several of his helpful suggestions have been adopted; of course,
responsibility for any mistakes remains with the author.

+ end of document +

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |