-r versus -f [was: train on error]

Mon Sep 8 10:25:59 CEST 2003

On 6 Sep 2003 at 1:08, jxz wrote:

> I suppose the Robinson-Fisher algorithm is used by default because it
> produced better results via pratical tests.

That right.

> But the Robinson algorithm produces a linear scale, as you say, and it
> is interesting, because we can look at the "real spaminess" that
> bogofilter thinks about the message, in a linear fashion, rather than a
> "distorted" function tending to 0 or 1.

I could be wrong, but I believe the *ordering* is same for Robinson vs 
Robinson-Fisher, but as you say the Robinson-Fisher is a distorted function 
with most of the values concentrated near 0 and 1

> The question is: Is there any general rule in "what algorithm should I
> use?" other than pratical experience with my own email?

I think the general rule is to use the default settings and algorithm 
first, as this is based on quite extensive testing. Of course it might not 
suit all conditions (and I used Robinson to help me do something different 
from normal).  

I think the performance is more dependent on the training than the 
algorithm. From your last email, your results (13% false negatives after 
2400 spam 6000 ham) are not as good as I would expect. Personally, I would 
suspect that your spam and ham corpus (corpera?) are not "clean" (some spam 
mixed with the ham corpus and vice versa).

You might do better with a few 100 manually checked messages than using a 
few thousand unchecked messages. I definitely would not use the -u flag to 
auto-update the database, as it is too easy to "pollute" the database. 

-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk