Bogofilter for general filesystem classification

Sun Sep 14 14:11:06 CEST 2003

On 20030914 (Sun) at 0013:22 -0400, David Relson wrote:
> Ben,
> 
> I'm sure Greg (the writer of bogotune) will comment on this thread
> within a day or two and will be able to give you a statistically more
> valid response than I can.  Until then ...

Well, he was right predicting I'd comment, wasn't he (see the previous
posting)?   :)

> linear.  There's also the Robinson-GeometricMean algorithm ('-r',
> algorithm=robinson), which is linear (more or less).  Computing the R-F
> algorithm first involves computing the R-GM result and then applying a
> chi square result (Fisher's modification) to evaluate the likelihood
> that the result is spam or ham (given the number of tokens involved). 

Let me put that a bit more accurately, pedant that I am: the
calculation of the spam score ("spamicity") takes place in two steps,
the first of which -- calculating the products of the individual token
P and Q values -- is the same for both the geometric-mean and the
Fisher algorithms.  The second step, in the geometric-mean case, is to
calculate the geometric means (doh) of the P and Q values and combine
them.  The Fisher variant, instead, involves calculating reverse
chi-squared probability values for P and Q by applying what's called
"Fisher's method of combining probabilities".  To quote from the
introduction to an experiment I did to compare the two
(http://www.bgl.nu/bogofilter/fisher.html):

"Fisher noted in the early fifties that if one has a number k of
independent estimates of probability, and the null hypothesis (that
these k values arose by pure chance) is true, then if you sum the
natural logarithms of the k estimates and multiply by -2, the result
will be distributed as chi-squared with 2k degrees of freedom. 
Accordingly, we can use our message's f(w) values to calculate
-2 * sum(ln(f(w))), with twice the number of tokens as the degrees
of freedom, and apply an inverse chi-squared function to obtain the
probability that the message is spam."

BTW that experiment very unexcitingly confirmed the theoretical
expectation: the discrimination of both algorithms depends entirely on
the first step and not at all on the second, so if bogofilter's
parameters are optimally tuned in both cases, the accuracy should be
identical.  We who prefer the Fisher method see two major advantages:
with Fisher, messages that are in fact ambiguous (these are very useful
in training) are clearly distinguishable from the obvious spam or
nonspam, and the spam cutoff value is less critical and more stable
when we apply Fisher's method than when we take geometric means.

Credit-where-credit-is-due department: it was Gary Robinson himself who
proposed the switch to Fisher's method as an improvement over the
geometric-mean algorithm.  The final combination of the two Fisher
probabilities is done in a way suggested by Tim Peters of the Spambayes
project.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |