Graham's method seemed better

Sat Nov 23 18:30:23 CET 2002

On 20021123 (Sat) at 0832:38 -0800, Tim Witham wrote:
> >>>>> "Boris" == Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> writes:
> 
> Boris> Greg Louis <glouis at dynamicro.on.ca> wrote:
> >>> evidence of an effect. But as you know I used bogofilter out
> >>> of the box and did not tweak anything.
> >> 
> >> Rem acu tetigisti.  You can't expect to do that successfully, any more
> >> than you can run a car without using the steering wheel.
> 
> If that is true, then there is at least a bug in the documentation.
> The cutoff is mentioned but says it is set at compile time.  Indeed I
> see no option to tweak it at run-time.  Sounds unimportant.  It's like
> a steering wheel installed in the engine compartment.  You must stop
> and pop the hood to change directions.  Not too user friendly.

I don't think you'd want to specify the cutoff at runtime, except by
way of the configuration file.  You would tweak it once every couple of
weeks, or less often than that as time goes on and training improves.

Robinson-Fisher, as the new variant of Robinson's method is being
called, may be less fussy (though I still tune mine carefully) about
the spam cutoff value.  Graham is much less fussy than either, and if
ease is preferred to optimal discrimination, decent results can be
obtained by going that route.  But users who do that must realize it's
a tradeoff, and not complain that Robinson's discrimination is bad when
in fact it's just that they don't wish to take the trouble to tune it.
I guess we should indeed have emphasized this point more when we first
offered Robinson's method.

(Over the next few weeks we expect to look at two more possible
calculation methods, and maybe they will not have this problem either.)

> Boris> But I do right now. And it will be crucial to do so to win new
> Boris> users.

Frankly, I don't care at this point to seek new users on the basis of
ease of use.  My goal is the most effective discrimination possible,
with ease of use something to try for as long as it doesn't conflict
with effective discrimination.  Once I no longer need to worry about
discrimination, yes, ease of use will become more of an interest.

> Or at least clue them in that they need adjust it.  I didn't know this
> either.  A run-time option and some documentation would be important
> for me where I want to let many users use the same bogofilter binary
> on their different mail.  Maybe stick it in a config file so it
> doesn't need to be specified for each run...

That's the right place for it.

> Wait, I just looked in config.c and there is a "spam_cutoff" so maybe
> it is already there.  But I still can't find the documentation of how
> to use it.

I'll provide a bit of info in my last paragraph, but I want to make
another point first.  That zero at the beginning of the version number
is there for a reason.  Less than a month ago, few of us had any real
understanding of what the various parameters do and how important their
values may be.  We're still learning and still chasing the optimum
calculation method; the documentation necessarily follows at a
distance.  It hurts to write lengthy documentation for functionality
you're going to discard next week or next month.

I guess what I'm saying here is that I don't recommend going into
production with any current Bayesian filter unless you're willing to
read Paul's and Gary's essays, understand what the program does, and
recompile it every once in a while to tune it to your email population.
If you want a finished product with nice complete docs, wait till David
releases 1.0.  (And even then, I don't think it will be a question of
"run this and your spams will magically disappear.")

In the meantime, what you do with spam_cutoff is adjust it until you
have what you deem an acceptably low number of false positives.  Then,
if the false-negative count seems too high, you do two things: lower
the spam_cutoff a bit, and train on all the mistakes.  When the
false-negative count gets low enough, up the spam_cutoff a bit again so
as to reduce false positives.  As an example, about ten days ago I
noticed that my false-negative rate had dropped under 1%; that was the
hint to me to move my Robinson geometric-mean spam_cutoff from 0.54 to
0.55, which resulted in false positives going from 0.4% to about 0.2%
and false negatives climbing back to roughly 1%.  Since I'd like to get
false positives down into the <0.1% range, I could bump the cutoff a
bit higher still; but then I'd get three or four spams a day.  If I
train carefully for another couple of weeks, I'll probably be able to
shift the boundary again.  (I've switched to Robinson-Fisher since
then, but the principle's roughly the same.)

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |