Robinson-Fisher (aka chi) beats Bayes chain rule

Sat Nov 30 02:39:57 CET 2002

A detailed writeup of this experiment will appear sometime during the
coming weekend at http://www.bgl.nu/~glouis/bogofilter/FisherBcr.html
but for now here's a summary:

An experiment was performed to compare the discriminatory power of
bogofilter using what is called the Robinson-Fisher method (f(w), with
the "spamicity" index calculated by applying Fisher's chi-squared
combination of probabilities) with that obtained if Graham's p(w) or
Robinson's f(w) values are fed into the Bayesian chain rule.

My code to implement the Bayesian chain rule has not yet been
published, but will appear as an appendix to the writeup at the above
URL.  It is directly derived from the implementation in crm114, version
crm114-2002.11.26.  See http://crm114.sourceforge.net if you're
curious.

For this experiment I accumulated 6372 spam and 6372 nonspam messages,
carefully screened to eliminate duplicates and misclassified messages.
Each corpus was distributed into four files: a training file containing
2124 messages, and three test files containing 1416 messages each.  The
distribution was done as you'd deal a deck of cards, like this:
  cat distrib
  #! /bin/sh
  # usage: FILENO=0 formail -s ./distrib [bad|good] <corpus.[bad|good]
  FILE=(t.$1 r1.$1 r2.$1 t.$1 r0.$1 r1.$1 t.$1 r2.$1 r0.$1)
  let n=${FILENO}%9
  fname=${FILE[$n]}
  cat >>$fname

A special version of bogofilter was prepared.  It has all the functions
of normal bogofilter, except that it supports only the Robinson-Fisher
method, and in addition it performs the evaluation of spamicity using
the Bayesian chain rule, both with the raw Graham p(w) values and with
Gary Robinson's f(w) modification.  This was trained with the spam and
nonspam training files mentioned above, and then run against each of
the three nonspam test files and each of the three spam test files.

The best way to compare different algorithms has been shown by
experience to be the following: establish an optimum level of false
positives (nonspams reported to be spam) for the reference algorithm
(in this case Robinson-Fisher), and adjust cutoff values so the test
algorithms (in this case Bayes Chain Rule with p(w) or f(w) as inputs)
deliver the same number of false positives; then compare the false
negatives (spams reported to be nonspam) for the algorithms under test.

The Bayes chain rule (Bcr), in my hands at least (and there may perhaps
be ways I haven't thought of to overcome this), is headstrong about its
opinions.  When run with the normal spam_cutoff value, the
Robinson-Fisher method produced 49, 40 and 39 false positives for the
three nonspam test files.  There was no corresponding spam_cutoff value
for Bcr-pw or Bcr-fw (the p(w) and f(w) inputs, respectively); the
lowest count of false positives available with Bcr-pw was 127, and the
lowest count available with Bcr-fw was 84 (roughly double the average
of the chi values).  I tentatively concluded that the Bcr method is too
dogmatic about its decisions; however, I decided to compare performance
of the Robinson-Fisher and Bcr-fw methods with spam_cutoff values set
to produce 84 false positives.

(Before I go on, let me explain graphically what I mean by "too
dogmatic."  Both algorithms produce values of the "spamicity" index
between 0 and 1, 0 being nonspam and 1 being spam.  If you plot the
distributions of these values for a large number of messages, you will
see something like this:

Robinson-geometric-mean  Robinson-Fisher  Bayes Chain Rule
           ---                  ----              --------
          /                    /                  |
         /                    |                   |
        /                     |                   |
       /                      |                   |
      /                       |                   |
     /                       /                    |
  ---                    ----              --------

All three methods have an area of doubt, within which the
discrimination between spam and nonspam is uncertain.  With the
Robinson-geometric-mean model, spams return values ranging from about
0.45 to 1, and nonspams return values ranging from 0 to about 0.55;
there's a grey area where they overlap and a decision is difficult.
With Robinson-Fisher, the grey area is more clearly delineated: most
spams score very close to 1, most nonspams are close to zero, and
anything from about 0.1 to 0.9 clearly belongs in the grey area. 
Nevertheless it's possible to choose a spam cutoff value, somewhere in
the range of 0.93 to 0.98, that will give quite strong discrimination
with relatively few errors with this algorithm.  With the Bayes chain
rule, from what I've seen, the grey area practically disappears; if a
mistake is made, the algorithm leaves almost no room to express doubt. 
The Bayes chain rule is said to learn extremely fast when trained on
its errors, so that it quickly reaches a state where very few errors
are made; but it offers little or no indication that, "hey, this one is
an oddball, I'm not sure but I think..." -- it just issues a dogmatic
classification.)

What I got in my test was the following:
    calc run fneg percent
1    chi   0   23    1.62
2    chi   1   17    1.20
3    chi   2   24    1.69
4 bcr-fw   0   27    1.91
5 bcr-fw   1   22    1.55
6 bcr-fw   2   27    1.91

In this table, the fneg column reports the actual counts of false
negatives (out of 1416) when the spam cutoff is set to deliver 84 false
positives.  The percent column gives the same results as percentages.

We can calculate the means and 95% confidence limits of these values,
and they come out like this:

    calc meanfnpc lcl95 ucl95
1    chi     1.51  1.45  1.56
2 bcr-fw     1.79  1.73  1.85

It turns out that the second of the three test spam files is easier to
classify than either the first or the third; the analysis takes this
into account, so the effect of run differences is excluded from the
above table.  These same results are plotted graphically in the
attached .png file; the vertical bars show the 95% confidence limits.

What we see, then, is that the Bayesian chain rule not only forces us
to accept a higher level of false positives than is seen with the
Robinson-Fisher method when the latter is "optimally" tuned, but also
delivers more false negatives when the false-positive counts are tuned
to match.  It follows from this that the Robinson-Fisher method may be
easier to adapt to our application, though it's conceivable that
someone who understands the Bayesian chain rule more deeply than I do
could tune it to perform better still.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bcr2.png
Type: image/png
Size: 2681 bytes
Desc: not available
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20021129/8a020bf3/attachment.png>