tuning bogofilter (was: bogofilter producing poor results)
William Ono
a1bformk at tinny.soundwave.net
Thu Nov 14 07:07:01 CET 2002
On Tue, Nov 12, 2002 at 03:58:36PM -0500, Greg Louis wrote:
> This is likely to become a FAQ so here's a bit of an explanation that I
> hope may help you and others interested in tuning Gary Robinson's f(w)
> and S calculation. Much of the substance is straight out of Gary's
> paper, but I've tried to emphasize the practical effects:
Greg, thanks again for your assistance. I calculated a more appropriate
value for ROBX (thanks Perl), played around with the other values, and I
think things are really happening now. Using the same data as last
time, I'm down to only 11 false negatives (5% of the spam!) and 2 false
positives; and those false positives were really just spammy-looking
newsletters that I usually trash after a quick glance anyway. If I
continue to see this sort of performance, you can definitely count me as
another happy user.
For your amusement, here are the values that I settled on:
#define MAX_PROB 0.9999f // max probability value used
#define MIN_PROB 0.0001f // min probability value used
#define ROBINSON_MIN_DEV 0.15f // if nonzero, use characteristic words
#define ROBINSON_SPAM_CUTOFF 0.54f // if it's spammier than this...
#define ROBINSON_MAX_REPEATS 1 // cap on word frequency per message
#define ROBS 0.01f // Robinson's s
#define ROBX 0.415f // Robinson's x
My ROBX turned out to be the same as yours, I think, but that's an
honest coincidence. But maybe the default value of 0.200 needs to be
cranked up a bit? I think changing this value had the greatest effect
on the results, followed (strangely enough) by kicking up the MIN_DEV.
Experimental results on a very small data set, so, as you said, YMMV.
And of course, as you pointed out, this should have less impact on the
results as the data set grows.
--
William Ono <a1bformk at tinny.soundwave.net>
PGP 2048R/93BA6AFD E3 64 C5 43 3E B3 2D A6 C6 D7 E3 45 90 24 78 DE
More information about the Bogofilter
mailing list