-r versus -f [was: train on error]

Greg Louis glouis at dynamicro.on.ca
Sat Sep 6 14:09:28 CEST 2003


On 20030906 (Sat) at 0108:08 -0300, jxz wrote:
> 
> I suppose the Robinson-Fisher algorithm is used by default because it
> produced better results via pratical tests.
> 
> But the Robinson algorithm produces a linear scale, as you say, and it
> is interesting, because we can look at the "real spaminess" that
> bogofilter thinks about the message, in a linear fashion, rather than a
> "distorted" function tending to 0 or 1.

Robinson's original geometric-mean algorithm and the Fisher variant are
exactly identical in discrimination capability (how could it be
otherwise when they differ only in how the individual token estimates
are combined?).  If a user finds that one seems to do better than the
other, that user's got one or both mistuned.

The linear scale produced by the geometric-mean algorithm results in
the program being harder to tune correctly; it is sensitive to
extremely small changes in the value of the spam cutoff, and the
optimal cutoff value changes as the training database changes, so it
needs frequent retuning to stay accurate.  The sigmoid scale produced
by applying Fisher's method has the benefit that it permits accurate
identification of real "uncertain" messages, and these have shown to be
valuable in training.  The sigmoid curve reflects _more_ accurately
than the linear one what bogofilter "thinks" about the message; it
expresses clearly that a given message is "almost certainly nonspam",
"almost certainly spam", or "truly ambiguous", given the current
training database.

Since the underlying discrimination capabilities are identical, it
doesn't matter in terms of accuracy which of the two you choose, as
long as you tune it very carefully; FWIW, I argued for Fisher to become
the default because it's easier to tune, needs less frequent tuning,
and makes bogofilter easier to train.

There was very extensive discussion about all this on the Spambayes
mailing list, I think around last November/December timeframe.  Anyone
interested in exploring the subject further should probably check the
Spambayes archive.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |




More information about the Bogofilter mailing list