Suggested default parameter values, was Re: Hapax survival over time

Thu Mar 25 15:37:15 CET 2004

On 20040325 (Thu) at 0836:41 -0500, Tom Anderson wrote:
> On Wed, 2004-03-24 at 13:29, Greg Louis wrote:
> > The suggested parameter values are:
> > 
> > robx=0.52
> > min_dev=0.375
> > robs=0.0178
> > spam_cutoff=0.99
> > ham_cutoff=0.45 (or 0 if one prefers binary evaluation)
> > 
> > and I am suggesting we make those the new defaults in the bogofilter
> > distribution.  People might like to try them (adjusting the spam cutoff
> 
> I don't see anything glaringly dangerous in those numbers.  The cutoffs
> are very conservative from my experience though... I would expect a
> fairly large number of unsures and false negatives with those values.  I
> regularly classify large volumes of spam in the 0.465 - 0.6 range.  Of
> the few unsures (all spam) I'm still getting, many are in the 0.3-0.465
> range, and I intend to inch my cutoff downward over time still.  Here
> are my current values:
> 
> robx=0.46
> min_dev=0.2
> robs=0.2
> spam_cutoff=0.465
> ham_cutoff=0.1

It's important to take into account that the distribution of message
scores can change drastically with alterations in bogofilter's
parameters.  If you use the new parameter values, you will find that
you need higher cutoffs than before (though your optima may indeed be
different from those I obtained in my test).

> Your cutoffs might be good for a brand new database, but after a few
> dozen registrations, I would highly suggest moving those cutoffs
> downward.

I'd wait longer than that and base any change on actual accuracy, but I
don't disagree with you in principle.  I will just remind you that
those cutoffs were found to be optimal, with those parameters, for
corpora containing well over 50,000 spam and 50,000 nonspam -- hardly
brand new databases.

> And although I agree with David when he says we want to bias
> toward ham with the robx, I've learned that 0.5 is by no means the
> ham/spam boundary line, despite intuition to the contrary.  0.52 is not
> a dangerous robx unless it is higher than the spam_cutoff and/or lower
> than the upper min_dev boundary, neither of which is the case with your
> numbers.  Something we might want to consider in light of this though is
> allowing the min_dev to be centered somewhere other than 0.5.  Perhaps
> at the midway point between your cutoffs would be a more neutral
> mid-point.

With the small s values it's moot.

> Interestingly though, despite your conservative cutoffs, your robs is
> rather insignificant.  For new databases, I'd suggest a high value, as a
> low value such as yours might cause large fluctuations which would be
> seen as instability in classifications.  Such a low value may only be
> useful with much larger databases where stronger tokens can anchor
> classifications more stably.  The robs value to me is a measure of
> self-doubt, which only decreases with experience.

That hasn't been my experience.  I agree with your theory but in
practice -- Gary and I have scratched our heads over this in the past
-- I have never found values of s greater than 0.1 to be as good
with my message population as lower values; neither when the corpus was
small, nor now, when it is a good size.  While I would certainly
encourage people to experiment with larger s as they become familiar
with how bogofilter's parameters operate, I would strongly deprecate
setting a large s value as the default.

> BTW, I've been considering something which I'd like some comments on.  I
> don't know if anyone else has tried it, but my bfproxy program is quite
> stable and working very well for me.  And now I'm thinking of adding
> some tuning capability to it.  Or this could be added directly to
> bogofilter if desired.  My basic concept is that every correction sent
> through bfproxy (if the option is set) will modify the config slightly. 
> For instance, unregistering a ham (-N) will cause the ham cutoff to be
> decreased by something like 0.001 (let's call it $mod).  Registering an
> unsure or ham as spam will cause the spam cutoff to be decreased
> similarly, and also maybe increase the robx by a similar amount.  Every
> action will decrease robs by a very small amount too, and also decrease
> $mod by a very small amount (~0.00001).  The result of this should be
> that users can start with a very conservative configuration, and slowly
> approach a more aggressive one determined by their own actions, with the
> configuration changing less as the database grows.  Any pros/cons?

Wow.  I've seen a 0.001 change in spam_cutoff make a 10% difference in
the fn count.  Good luck!  I think the concept is a very interesting
one, but I also think it'll take a huge amount of effort to build,
debug and tune it to the point where a naïve user can trust it -- if
that's possible at all.  I've always felt that bogotune is at best a
rough guide to start the knowledgeable user on the path to good
parameters, and that human wisdom will always be needed to get them
really right.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |