a source of useful info and a question (longish)

Tue Nov 19 13:51:29 CET 2002

On 20021118 (Mon) at 1815:18 -0500, Tim Peters wrote:
> [Greg Louis]
> > ...
> > Both geometric-mean and Fisher-based calculation use
> >   scalefactor = badlist_messagecount / goodlist_messagecount
> > in bogofilter.  In spambayes, if the goodlist has more tokens,
> >   scalefactor = badlist_tokencount / goodlist_tokencount)
> > and otherwise
> >   scalefactor = goodlist_tokencount / badlist_tokencount
> 
> nspam and nham in the spambayes code are actually the # of msgs of each kind
> trained on, not the # of tokens they contain.

Oops, missed that.  In that case, we do the same except we always scale
to the size of the spam list, even if it's bigger -- probably the
spambayes scale-to-smaller is a good idea, but I doubt it makes much
difference unless the db is horribly lopsided.

> > We calculate, for each token from w = 1 to n where n is the number of
> > unique tokens in the message,
> >   f(w) = (s * x + badcount) / (s + badcount + goodcount * scalefactor)
> > and spambayes does something essentially the same.
> 
> I don't think so, but they reduce to the same if nspam == nham.  If nspam is
> much larger than nham, I believe they can be quite different.  I'm not sure
> how your formula was derived.  The spambayes one starts with the pure
> by-counting spamprob estimate
> 
>     p(w) = spamratio(w) / (spamratio(w) + hamratio(w))
> where
>     spamratio = (# of spam w appears in)/(total # of spam msgs)
>     hamratio  = (# of ham w appears in)/(total # of ham msgs)
> 
> and then does a Bayesian adjustment to that
> 
>     f(w) = (s*x + n*p(w)) / (s+n)

> Your f(w) has a partly similar
> form, but it's unclear (to me) what posterior probability it's an adjustment
> *of*.

I think they're identical.  Given that n is badcount+goodcount,
n * p(w) is badcount+goodcount * badcount / (badcount + goodcount)
which resolves to badcount.  Similarly, your p(w) and ours are the
same, ours just being yours multiplied top and bottom by nspam.

> > In the geometric-mean calculation,
> >   P = 1 - exp(sum(ln(1-f(w))) / n)
> >   Q = 1 - exp(sum(ln(f(w))) / n)
> >   S = (1 + (P-Q) / (P+Q)) / 2
> > where S is the "spamicity" value.
> 
> The frexp() trick in spambayes gary_spamprob() can be used to save all the
> ln() calls (simulating unbounded dynamic float range).

Yes, I thought that was kewl.

> > Fisher's method uses an inverse chi-squared function, prbx, to get the
> > probability associated with -2 times the sum of the logs of f(w) with
> > 2n degrees of freedom:
> >   P = prbx(-2 * sum(ln(1-f(w))), 2*n)
> >   Q = prbx(-2 * sum(ln(f(w))), 2*n)
> >   S = (1 + Q - P) / 2
> 
> Just noting that prbx(x, v) is the probability that a random variable
> following a chi-squared distribution with v degrees of freedom has value >=
> x.  It's the "right tail" area under the chi-squared density curve.
> 
> > the discriminatory power of Robinson's method, if
> > forced to decide between spam and not spam with no third choice
> > available, is the same whether geometric mean or Fisher-combining is
> > used in the final step.
> 
> Since the geomean method is monotonic with Fisher's, this may even be
> expected.  The pragmatic problem our testers found was that specifying an
> equally good cutoff for geomean required more than two decimal digits of
> precision, varied across peoples' test data, varied within a single corpus
> with amount of training data, and varied within a single corpus when
> changing other system options.

I agree wholeheartedly with all those findings.

> In contrast, chi cutoff values aren't touchy
> at all, if you're willing to take very good results instead of optimal
> results; one digit of precision is usually plenty, and the spambayes
> defaults of ham_cutoff=.20 and spam_cutoff=.90 work pretty well for
> everyone.  With very large training sets, I push them to .30 and .80.  BTW,
> the asymmetry here seems due to that, in test after test, the spam score
> distribution is tighter than the ham score distribution.

Chi values seem as touchy if you're forcing binary decisions and/or
varying the minimum deviation.  For example, in the experiment I
mentioned, I had to go to a cutoff of 0.952 to get chi to give the same
false-positive count, with minimum deviation 0.1, as geometric gave with
0.583 as cutoff and the same minimum deviation.  0.95 and 0.58,
respectively, diverged slightly.

> > What is not so clear is whether bogofilter ought to adopt the ternary
> > approach: divide messages into spam, nonspam and unknown and feed
> > nonspam and unknown to the user with X-Bogosity set to No or Uncertain.
> > I can see arguments both ways.  What do you think?
> 
> my impression from my own email) evidence says that when chi is unsure, it
> doesn't do better than chance; if so, a method monotonic with it won't do
> better either.
> 
> If you bite that bullet, confessing up-front that sometimes the method is
> plain lost, then the value of chi over geomean is that it's much easier to
> tell when it's lost.

Since Sunday I've become convinced that this is so, and is a
considerable advantage.  I've got my production versions of bogofilter
running the Fisher-based (chi combination) method now, despite saying a
few days ago that I wasn't going to do so any time soon :)  The thing
that turned the scale was that discrimination power is the same, so why
not get the extra benefits?  (I still have to look hard at throughput,
but for my mail traffic and server muscle there's no problem with chi.)

> The *median* spam score under chi is 1 (to 6
> significant digits), and the median ham socre is 0 (ditto).  It's hard to
> mistake those for "I'm confused" <wink>.  We'd rather tell users when we
> believe the method has broken down.  Users who can't deal with that can
> pretend we said "ham", and curse us for the spam that ends up there.

Just so.

> And it turns out that training on Unsures alone leads to a very
> effective classifier, but one more prone to more Unsures than a
> heavily trained classifier (there are so few unsures that
> unsure-driven training ends up relying (IMO) too much on hapaxes).

Training of a substantial (>5000 msg) db on _mistakes_ alone (defined
in the binary sense) improves performance impressivley quickly too.

(BTW that was the first new nontech word I've learned in quite some
time - "hapax" - a word or form occurring only once in a document or
corpus, according to Webster.  Came into English from Greek in 1882,
apparently. ;-)

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |