tuning bogofilter (was: bogofilter producing poor results)

Greg Louis glouis at dynamicro.on.ca
Thu Nov 14 13:18:37 CET 2002


Glad the tuning worked so well for you!  Just one caveat:

On 20021113 (Wed) at 2207:01 -0800, William Ono wrote:
> 
> For your amusement, here are the values that I settled on:
> 
> #define MAX_PROB        0.9999f         // max probability value used
> #define MIN_PROB        0.0001f         // min probability value used

These aren't used in the Robinson calculation.

I'm interested in the MIN_DEV effect, and intend to run a major
experiment to characterize it.  I've been holding off because Gary
Robinson has come up with another suggestion for calculating S values,
namely, to calculate -2 * sum(ln(1-f(w))) and -2 * sum(ln(fw))) and
treat them as values of chi-squared with 2n degrees of freedom, where n
is the number of f(w) values used in the sums.  These are fed into an
inverse chi-squared function, giving probabilities P and Q
respectively, and then S = (1 + Q - P)/2.  The spambayes project has
been claiming yet better results with this than with the
logarithmic-mean calculation Gary originally proposed (which is what
bogofilter uses).  In my hands, however, initial tests have been quite
disappointing; after tuning both to give 18 false positives out of 4512
nonspams, I got 98 false negatives in 1594 spams with logarithmic mean
and 264 false negatives on the same 1594 spams with chi-squared.  Gary
and I have been corresponding about this but it looks as though, with
chi-squared, there's a largeish middle ground where the calculation is
telling us that the training is inadequate to support a decision.  This
is theoretically interesting but the logarithmic-mean approach is
guessing right a large proportion of the time, even (for me anyway)
with MIN_DEV set to zero.  Quite possibly a non-zero MIN_DEV would make
a difference to the chi-squared outcome; I must look into that!

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |




More information about the Bogofilter mailing list