a source of useful info and a question (longish)

Mon Nov 18 19:22:39 CET 2002

I've downloaded spambayes to take a look at how that project's
classification scheme works.  The classifier.py file is very nicely
commented, and contains a lot of valuable discussion.  Tim Peters, who
wrote the comments, has kindly given me permission to quote some of my
favourites here.

As background, here's a quick summary of the Robinson calculation
method in bogofilter-0.8.0 (geometric-mean) and that used in spambayes,
which we're introducing into cvs as a compile-time alternative.  The
latter is based on Fisher's method of combining probabilities, which is
described in the first comment quoted below.

Both geometric-mean and Fisher-based calculation use
  scalefactor = badlist_messagecount / goodlist_messagecount
in bogofilter.  In spambayes, if the goodlist has more tokens,
  scalefactor = badlist_tokencount / goodlist_tokencount)
and otherwise
  scalefactor = goodlist_tokencount / badlist_tokencount

We calculate, for each token from w = 1 to n where n is the number of
unique tokens in the message,
  f(w) = (s * x + badcount) / (s + badcount + goodcount * scalefactor)
and spambayes does something essentially the same.

In the geometric-mean calculation,
  P = 1 - exp(sum(ln(1-f(w))) / n)
  Q = 1 - exp(sum(ln(f(w))) / n)
  S = (1 + (P-Q) / (P+Q)) / 2
where S is the "spamicity" value.

Fisher's method uses an inverse chi-squared function, prbx, to get the
probability associated with -2 times the sum of the logs of f(w) with
2n degrees of freedom:
  P = prbx(-2 * sum(ln(1-f(w))), 2*n)
  Q = prbx(-2 * sum(ln(f(w))), 2*n)
  S = (1 + Q - P) / 2

Now here are some of Tim's instructive comments.  I particularly liked
this one, which explains Fisher's method of combining probabilities and
describes a situation David Relson encountered early in testing it:

# Across vectors of length n, containing random uniformly-distributed
# probabilities, -2*sum(ln(p_i)) follows the chi-squared distribution
# with 2*n degrees of freedom.  This has been proven (in some
# appropriate sense) to be the most sensitive possible test for
# rejecting the hypothesis that a vector of probabilities is uniformly
# distributed.  Gary Robinson's original scheme was monotonic *with*
# this test, but skipped the details.  Turns out that getting closer
# to the theoretical roots gives a much sharper classification, with
# a very small (in # of msgs), but also very broad (in range of scores),
# "middle ground", where most of the mistakes live.  In particular,
# this scheme seems immune to all forms of "cancellation disease": if
# there are many strong ham *and* spam clues, this reliably scores
# close to 0.5.  Most other schemes are extremely certain then -- and
# often wrong.

AFAICS the above is one reason the spambayes folk think chi-combining
(the Fisher method) is better than the geometric mean.  The next
comment adds to the explanation of this reason.  It explains part of
why S = Q - P in the Fisher-based method (their S is our Q and their H
is our P):

# How to combine these into a single spam score?  We originally
# used (S-H)/(S+H) scaled into [0., 1.], which equals S/(S+H).  A
# systematic problem is that we could end up being near-certain
# a thing was (for example) spam, even if S was small, provided
# that H was much smaller.
# Rob Hooft stared at these problems and invented the measure
# we use now, the simpler S-H, scaled into [0., 1.].

And there's this one, which addresses much more authoritatively than I
could the issue of MAX_REPEATS that's come up twice in the last week or
so on this list:

# NOTE:  Graham's scheme had a strange asymmetry:  when a word appeared
# n>1 times in a single message, training added n to the word's hamcount
# or spamcount, but predicting scored words only once.  Tests showed
# that adding only 1 in training, or scoring more than once when
# predicting, hurt under the Graham scheme.
# This isn't so under Robinson's scheme, though:  results improve
# if training also counts a word only once.  The mean ham score decreases
# significantly and consistently, ham score variance decreases likewise,
# mean spam score decreases (but less than mean ham score, so the spread
# increases), and spam score variance increases.
# I (Tim) speculate that adding n times under the Graham scheme helped
# because it acted against the various ham biases, giving frequently
# repeated spam words (like "Viagra") a quick ramp-up in spamprob; else,
# adding only once in training, a word like that was simply ignored until
# it appeared in 5 distinct training hams.  Without the ham-favoring
# biases, though, and never ignoring words, counting n times introduces
# a subtle and unhelpful bias.
# There does appear to be some useful info in how many times a word
# appears in a msg, but distorting spamprob doesn't appear a correct way
# to exploit it.

NB: bogofilter with --enable-robinson-fisher is using the same
classification method as spambayes, except that spambayes scales to the
number of tokens, not the number of messages, in the smaller of goodlist
and spamlist.  Might be worth trying that.

Nota benissime: spambayes folks are getting false-negative rates under
1%, with training sets of around 4000 nonspams and 2700 spams; but this
may be because they segregate unknowns and exclude them from the count. 
Of course, they also get very low false-negative rates with very large
(14000 spams and 20000 nonspams) training sets.  An experiment I did
this weekend (http://www.bgl.nu/~glouis/bogofilter/fisher.html) showed
quite clearly that the discriminatory power of Robinson's method, if
forced to decide between spam and not spam with no third choice
available, is the same whether geometric mean or Fisher-combining is
used in the final step.

What is not so clear is whether bogofilter ought to adopt the ternary
approach: divide messages into spam, nonspam and unknown and feed
nonspam and unknown to the user with X-Bogosity set to No or Uncertain. 
I can see arguments both ways.  What do you think?

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |

----- End forwarded message -----

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |