tuning bogofilter (was: bogofilter producing poor results)
Greg Louis
glouis at dynamicro.on.ca
Thu Nov 14 13:18:37 CET 2002
Glad the tuning worked so well for you! Just one caveat:
On 20021113 (Wed) at 2207:01 -0800, William Ono wrote:
>
> For your amusement, here are the values that I settled on:
>
> #define MAX_PROB 0.9999f // max probability value used
> #define MIN_PROB 0.0001f // min probability value used
These aren't used in the Robinson calculation.
I'm interested in the MIN_DEV effect, and intend to run a major
experiment to characterize it. I've been holding off because Gary
Robinson has come up with another suggestion for calculating S values,
namely, to calculate -2 * sum(ln(1-f(w))) and -2 * sum(ln(fw))) and
treat them as values of chi-squared with 2n degrees of freedom, where n
is the number of f(w) values used in the sums. These are fed into an
inverse chi-squared function, giving probabilities P and Q
respectively, and then S = (1 + Q - P)/2. The spambayes project has
been claiming yet better results with this than with the
logarithmic-mean calculation Gary originally proposed (which is what
bogofilter uses). In my hands, however, initial tests have been quite
disappointing; after tuning both to give 18 false positives out of 4512
nonspams, I got 98 false negatives in 1594 spams with logarithmic mean
and 264 false negatives on the same 1594 spams with chi-squared. Gary
and I have been corresponding about this but it looks as though, with
chi-squared, there's a largeish middle ground where the calculation is
telling us that the training is inadequate to support a decision. This
is theoretically interesting but the logarithmic-mean approach is
guessing right a large proportion of the time, even (for me anyway)
with MIN_DEV set to zero. Quite possibly a non-zero MIN_DEV would make
a difference to the chi-squared outcome; I must look into that!
--
| G r e g L o u i s | gpg public key: |
| http://www.bgl.nu/~glouis | finger greg at bgl.nu |
More information about the Bogofilter
mailing list