Minimum deviation and Robinson's s parameter

Greg Louis glouis at dynamicro.on.ca
Sat Dec 21 17:02:37 CET 2002


In bogofilter we can set a minimum deviation from 0.5, below which
Robinson f(w) values are considered too uncharacteristic to be included
in the calculation; several of us have found that setting a value
around 0.1 or 0.15 instead of 0 improves discrimination slightly.  We
can also set Robinson's s parameter, which determines the weight
given to the prior p(w) guess, x, when counts for a given token are
small.

These two parameters interact in a non-obvious way to affect
bogofilter's error rate.  See
  http://www.bgl.nu/bogofilter/param.html
for an explanation of the calculations and an exploration of the
effects.  There are some pretty 3D graphs... ;-)

For those who just want the bottom line, here's the general conclusion:

"Random choice of parameters like min_dev and s, or choice based on
limited experience, or blind use of the defaults that come with the
bogofilter distribution, is not likely to give optimum discrimination
between spams and nonspams.  Tuning is required, and is likely to be
required again from time to time as bogofilter training improves."

I'm not far enough along this line of investigation to draft or
contribute to a tuning HOWTO.  What I can suggest is that creating a
smallish experimental set with spams and nonspams that haven't been
used for training (I took 20 spams and 20 nonspams), and playing with
the min_dev and s values, isn't a bad way to get initial estimates.  It
helps to vary the parameters over a wide range, though; as the
experiment reported at the above URL shows, there are local minima to
watch for.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |




More information about the Bogofilter mailing list