classification calculation method -- time to make Fisher the default?

Mon Jan 13 14:18:33 CET 2003

I just wrote a bit of an explanation of how I go about supporting
multiple users of one bogofilter training set.  In that context I
explained why Robinson-Fisher is the calculation method of choice for
me.  This raised two questions I'd been thinking about anyway:

1.  Of those that use bogofilter in production, how many are on
Robinson-GM and how many on Robinson-Fisher?  Is anyone still using the
original Graham calculation?  Of those on Robinson-GM, how many have
tried Fisher?  For you who did, why didn't you switch?  Was it because
Fisher was worse, or just no better, in your hands?

2.  Can we please make Fisher the default at the next release?  It's no
big deal (I always change my copy to do that anyway), except that
defaulting to the geometric-mean evaluation sends new bogofilter users
in what I believe is the wrong direction.

Lemme explain that.  Unsurprisingly (given that they both depend on the
same f(w) calculation as input), the two Robinson methods have
identical discrimination power; anyone who sets the operating
parameters optimally will get equally good binary discrimination (spam
vs nonspam) with either.  Where the Robinson-Fisher approach excels is
that it makes very clear those cases where the decision is uncertain;
that is, where there are strong hints in both directions and the
classification is a rough guess.  Most spam (over 90% with a
well-trained db) get "spamicity" values (S) very very close to 1 when
classified with Robinson-Fisher; similarly, for a high proportion of
nonspams, S will be below 0.1.  Anything between 0.1 and about 0.99 is
"uncertain;" when deciding in the binary sense, these are deemed
nonspam, but this group is likely to contain some spams that will be
delivered as false negatives.

Why does it matter?  It matters for training.  If you want a db which
is optimally compact and yet optimally effective, feed the first 10,000
nonspams through bogofilter -n and the first 10,000 spams through
bogofilter -s; thereafter, train with any actual false positives you
encounter, plus any actual false negatives -- and here's the real win
-- plus all of the uncertains, separated into spams and nonspams by
human inspection.  (The first 10,000 nonspams and 10,000 spams have to
be identified by human inspection as well -- bogofilter's -u option, if
used, still requires that someone check every single email and back out
the errors with -S and -N.  That's a pig of a job, and switching to
just known mistakes plus uncertains saves a huge amount of time.)

Why else might it matter?  I wrote above that if you set the operating
parameters optimally, you can get equally good discrimination between
spam and nonspam with Robinson-Geometric-Mean (the current default).
What I didn't mention is that it's a lot easier to get the parameters,
specifically the spam cutoff, right with Robinson-Fisher.  With
Robinson-GM the spam cutoff is very very critical; changes in the third
place of decimals make a noticeable difference.  Not many people will
have the patience to fiddle-and-test till it's right, and anyway, it
will need retuning after a bit more training has been done.  With
Robinson-Fisher, I quickly settled on 0.1 and 0.99 for my cutoff values
(nonspam and spam respectively) and have had no need to change since
that, even though my training db has grown by 40% in the interval.

Anyone who would like some hard experimental results to back up the
assertions I've made above is invited to take a look at
http://www.bgl.nu/bogofilter and the experiments to which that page has
links, especially fisher.html, training2.html and param.html.  I'll
gladly discuss this further, on or off line as appropriate.

Most people who start using bogofilter start with the default algorithm
and the default parameters.  Let's point them in the right direction in
future.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |