-r versus -f [was: train on error]
Peter Bishop
pgb at adelard.com
Mon Sep 8 10:25:59 CEST 2003
On 6 Sep 2003 at 1:08, jxz wrote:
> I suppose the Robinson-Fisher algorithm is used by default because it
> produced better results via pratical tests.
That right.
> But the Robinson algorithm produces a linear scale, as you say, and it
> is interesting, because we can look at the "real spaminess" that
> bogofilter thinks about the message, in a linear fashion, rather than a
> "distorted" function tending to 0 or 1.
I could be wrong, but I believe the *ordering* is same for Robinson vs
Robinson-Fisher, but as you say the Robinson-Fisher is a distorted function
with most of the values concentrated near 0 and 1
> The question is: Is there any general rule in "what algorithm should I
> use?" other than pratical experience with my own email?
I think the general rule is to use the default settings and algorithm
first, as this is based on quite extensive testing. Of course it might not
suit all conditions (and I used Robinson to help me do something different
from normal).
I think the performance is more dependent on the training than the
algorithm. From your last email, your results (13% false negatives after
2400 spam 6000 ham) are not as good as I would expect. Personally, I would
suspect that your spam and ham corpus (corpera?) are not "clean" (some spam
mixed with the ham corpus and vice versa).
You might do better with a few 100 manually checked messages than using a
few thousand unchecked messages. I definitely would not use the -u flag to
auto-update the database, as it is too easy to "pollute" the database.
--
Peter Bishop
pgb at adelard.com
pgb at csr.city.ac.uk
More information about the Bogofilter
mailing list