tuning bogofilter (was: bogofilter producing poor results)

Thu Nov 14 07:07:01 CET 2002

On Tue, Nov 12, 2002 at 03:58:36PM -0500, Greg Louis wrote:
> This is likely to become a FAQ so here's a bit of an explanation that I
> hope may help you and others interested in tuning Gary Robinson's f(w)
> and S calculation.  Much of the substance is straight out of Gary's
> paper, but I've tried to emphasize the practical effects:

Greg, thanks again for your assistance.  I calculated a more appropriate
value for ROBX (thanks Perl), played around with the other values, and I
think things are really happening now.  Using the same data as last
time, I'm down to only 11 false negatives (5% of the spam!) and 2 false
positives; and those false positives were really just spammy-looking
newsletters that I usually trash after a quick glance anyway.  If I
continue to see this sort of performance, you can definitely count me as
another happy user.

For your amusement, here are the values that I settled on:

#define MAX_PROB        0.9999f         // max probability value used
#define MIN_PROB        0.0001f         // min probability value used
#define ROBINSON_MIN_DEV        0.15f   // if nonzero, use characteristic words
#define ROBINSON_SPAM_CUTOFF    0.54f   // if it's spammier than this...
#define ROBINSON_MAX_REPEATS    1       // cap on word frequency per message
#define ROBS                    0.01f   // Robinson's s
#define ROBX                    0.415f  // Robinson's x

My ROBX turned out to be the same as yours, I think, but that's an
honest coincidence.  But maybe the default value of 0.200 needs to be
cranked up a bit?  I think changing this value had the greatest effect
on the results, followed (strangely enough) by kicking up the MIN_DEV.
Experimental results on a very small data set, so, as you said, YMMV.
And of course, as you pointed out, this should have less impact on the
results as the data set grows.

--
William Ono <a1bformk at tinny.soundwave.net>
PGP 2048R/93BA6AFD E3 64 C5 43 3E B3 2D A6    C6 D7 E3 45 90 24 78 DE