Robinson's s parameter

Greg Louis glouis at dynamicro.on.ca
Fri Apr 18 15:22:23 CEST 2003


David and I have been trying for some time now to understand why
varying Robinson's s over the range 0.001 to 1e-8 can have a
significant effect on the spam scores of some messages.  David found a
message the score of which was 0.999 at s=0.001, and 0.505 at s=1e-8.
He pointed me in the right direction (I suspected a different reason),
and with that hint, looking at the tables printed with bogofilter -R at
these two s values made it very clear what's going on:

Small values of s give disproportionate weight to tokens that appear
only in one or the other of the two wordlists in the training database.

This is because, if a token occurs only in the goodlist, p(w) will be
0, and if one occurs only in the spamlist, p(w) will be 1.  In either
case, the f(w) calculation
   fw = (s * x + n * pw) / (s + n)
is going to give a value that differs from 0 or 1 depending solely on
the magnitude of s.

It follows that s should probably not be allowed to go below about
0.01, to avoid distorting the "spamicity" calculations for messages
with tokens that appear in only one or the other of the wordlists.

Bogofilter currently defaults to an s value of 0.001, which is not too
extreme, but should probably be altered.  Note that changing s will
likely change the distribution of scores, so you'll need to twiddle the
spam_cutoff value too.

The experiment written up at http://www.bgl.nu/bogofilter/smindev3.html
points to a quite high minimum deviation (0.44) and an s value of
around 0.1 as giving good results with several different corpora of
email.  Be warned: this is new information, ymmv.  In particular, if
your training database has relatively few (under 5000) spams or
nonspams in it, a minimum deviation that high may not work well for
you; that hasn't been tested yet.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |




More information about the Bogofilter mailing list