Suggested default parameter values, was Re: Hapax survival over time

Thu Mar 25 14:36:41 CET 2004

On Wed, 2004-03-24 at 13:29, Greg Louis wrote:
> The suggested parameter values are:
> 
> robx=0.52
> min_dev=0.375
> robs=0.0178
> spam_cutoff=0.99
> ham_cutoff=0.45 (or 0 if one prefers binary evaluation)
> 
> and I am suggesting we make those the new defaults in the bogofilter
> distribution.  People might like to try them (adjusting the spam cutoff

I don't see anything glaringly dangerous in those numbers.  The cutoffs
are very conservative from my experience though... I would expect a
fairly large number of unsures and false negatives with those values.  I
regularly classify large volumes of spam in the 0.465 - 0.6 range.  Of
the few unsures (all spam) I'm still getting, many are in the 0.3-0.465
range, and I intend to inch my cutoff downward over time still.  Here
are my current values:

robx=0.46
min_dev=0.2
robs=0.2
spam_cutoff=0.465
ham_cutoff=0.1

Your cutoffs might be good for a brand new database, but after a few
dozen registrations, I would highly suggest moving those cutoffs
downward.  And although I agree with David when he says we want to bias
toward ham with the robx, I've learned that 0.5 is by no means the
ham/spam boundary line, despite intuition to the contrary.  0.52 is not
a dangerous robx unless it is higher than the spam_cutoff and/or lower
than the upper min_dev boundary, neither of which is the case with your
numbers.  Something we might want to consider in light of this though is
allowing the min_dev to be centered somewhere other than 0.5.  Perhaps
at the midway point between your cutoffs would be a more neutral
mid-point.

Interestingly though, despite your conservative cutoffs, your robs is
rather insignificant.  For new databases, I'd suggest a high value, as a
low value such as yours might cause large fluctuations which would be
seen as instability in classifications.  Such a low value may only be
useful with much larger databases where stronger tokens can anchor
classifications more stably.  The robs value to me is a measure of
self-doubt, which only decreases with experience.

BTW, I've been considering something which I'd like some comments on.  I
don't know if anyone else has tried it, but my bfproxy program is quite
stable and working very well for me.  And now I'm thinking of adding
some tuning capability to it.  Or this could be added directly to
bogofilter if desired.  My basic concept is that every correction sent
through bfproxy (if the option is set) will modify the config slightly. 
For instance, unregistering a ham (-N) will cause the ham cutoff to be
decreased by something like 0.001 (let's call it $mod).  Registering an
unsure or ham as spam will cause the spam cutoff to be decreased
similarly, and also maybe increase the robx by a similar amount.  Every
action will decrease robs by a very small amount too, and also decrease
$mod by a very small amount (~0.00001).  The result of this should be
that users can start with a very conservative configuration, and slowly
approach a more aggressive one determined by their own actions, with the
configuration changing less as the database grows.  Any pros/cons?

Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040325/3e1a5479/attachment.sig>