tuning bogofilter (was: bogofilter producing poor results)
a1bformk at tinny.soundwave.net
Thu Nov 14 01:07:01 EST 2002
On Tue, Nov 12, 2002 at 03:58:36PM -0500, Greg Louis wrote:
> This is likely to become a FAQ so here's a bit of an explanation that I
> hope may help you and others interested in tuning Gary Robinson's f(w)
> and S calculation. Much of the substance is straight out of Gary's
> paper, but I've tried to emphasize the practical effects:
Greg, thanks again for your assistance. I calculated a more appropriate
value for ROBX (thanks Perl), played around with the other values, and I
think things are really happening now. Using the same data as last
time, I'm down to only 11 false negatives (5% of the spam!) and 2 false
positives; and those false positives were really just spammy-looking
newsletters that I usually trash after a quick glance anyway. If I
continue to see this sort of performance, you can definitely count me as
another happy user.
For your amusement, here are the values that I settled on:
#define MAX_PROB 0.9999f // max probability value used
#define MIN_PROB 0.0001f // min probability value used
#define ROBINSON_MIN_DEV 0.15f // if nonzero, use characteristic words
#define ROBINSON_SPAM_CUTOFF 0.54f // if it's spammier than this...
#define ROBINSON_MAX_REPEATS 1 // cap on word frequency per message
#define ROBS 0.01f // Robinson's s
#define ROBX 0.415f // Robinson's x
My ROBX turned out to be the same as yours, I think, but that's an
honest coincidence. But maybe the default value of 0.200 needs to be
cranked up a bit? I think changing this value had the greatest effect
on the results, followed (strangely enough) by kicking up the MIN_DEV.
Experimental results on a very small data set, so, as you said, YMMV.
And of course, as you pointed out, this should have less impact on the
results as the data set grows.
William Ono <a1bformk at tinny.soundwave.net>
PGP 2048R/93BA6AFD E3 64 C5 43 3E B3 2D A6 C6 D7 E3 45 90 24 78 DE
More information about the Bogofilter