Suggested default parameter values, was Re: Hapax survival over time

Fri Mar 26 15:32:06 CET 2004

On Thu, 2004-03-25 at 09:37, Greg Louis wrote:
> It's important to take into account that the distribution of message
> scores can change drastically with alterations in bogofilter's
> parameters.  If you use the new parameter values, you will find that
> you need higher cutoffs than before (though your optima may indeed be
> different from those I obtained in my test).

I started my database with the old defaults and have slowly made tweaks
in both directions with all of my values.  My existing set is the best
result of that process so far.  Are you saying that if I started with a
different set of defaults, my configuration would converge to a
completely different set of values?  If that is true, we need some
chaoticians in here to explain it ;)

> I'd wait longer than that and base any change on actual accuracy, but I
> don't disagree with you in principle.  I will just remind you that
> those cutoffs were found to be optimal, with those parameters, for
> corpora containing well over 50,000 spam and 50,000 nonspam -- hardly
> brand new databases.

I'm quite certain that a spam cutoff of 0.99 will create far too many
false negatives, and if this value were to be used as a default, it
would have to be decreased very quickly for anyone to get the kind of
results they would expect (>80% filtration).  I find it very unlikely
that hams would ever surpass 0.5, so my spam_cutoff default would be in
the 0.6-0.8 range, but still decreased after a few dozen to a few
hundred registrations.  But I haven't tested your numbers from scratch,
so I could be wrong.

> With the small s values it's moot.

It's not moot because the min_dev value plays a large role even with a
small s.  Centering the range at a point between the cutoffs instead of
0.5 would mean that the area where you expect a token to be undecidable
coincides with the range of scores in which you'd expect an email to be
undecidable.  Granted these two concepts don't wholly overlap, but that
may be part of the problem.  If I receive an email containing a single
word, then it would be expected that the email score would equal the
token score.

As it is now, however, I've seen claims of tiny min_devs (~0.01)
centered around 0.5, but cutoffs nowhere near 0.5 (~0.1-0.2).  This
makes little sense.  In such a situation, 0.5 has no meaning whatsoever,
which is why a very small min_dev is required (to make 0.5 meaningless
as it actually is).  This may also be why your test showed such a tiny
robs value... because your other values made robx meaningless.

Well I'm of the opinion that robx and min_dev are very important
concepts that we want to take advantage of.  And I believe I am doing so
with my current values.  This means a significant robs and min_dev value
are required and useful.  If you do not use them correctly though, you
will of course get better results by minimizing their effect, which is
precisely what you and some others on this list appear to be doing. 
Perhaps some assumptions are made in bogotune which lead to this?

> That hasn't been my experience.  I agree with your theory but in
> practice -- Gary and I have scratched our heads over this in the past
> -- I have never found values of s greater than 0.1 to be as good
> with my message population as lower values; neither when the corpus was
> small, nor now, when it is a good size.  While I would certainly
> encourage people to experiment with larger s as they become familiar
> with how bogofilter's parameters operate, I would strongly deprecate
> setting a large s value as the default.

It may be that your entrentched theory is entirely wrong and that you
are trying to bend values to fit your theory.  I've noticed that you've
maintained that train-on-error and training-to-exhaustion is entirely
useless (or at least that you personally have found no positive effect)
whereas I and others on this list have found it very useful.  In this
same vein, I have found that increasing robs has been useful.  Now, I
don't want to assume that I'm doing something right and you're doing
something wrong, as I maintain that I am likely the more amateur in
regard to the statistical math going on here, but I'm starting to wonder
why your results differ so profoundly.  It seems to me that if you have
not found significant values of robs to be practical, then you are
misusing robx such that minimizing the affect of this misuse is useful.

> Wow.  I've seen a 0.001 change in spam_cutoff make a 10% difference in
> the fn count.  Good luck!  I think the concept is a very interesting

Ok, well 0.001 was an arbitrary number which I figured would have very
little effect.  A smaller one may indeed be necessary.  The point is to
very slowly inch numbers into the correct direction such that no single
move causes major change, but that over time, the overall effect will be
more aggressive filtering in the correct range of values.  You might
call this, to coin a term, tune-on-error.

> one, but I also think it'll take a huge amount of effort to build,
> debug and tune it to the point where a naïve user can trust it -- if

This is exactly the point though.

> that's possible at all.  I've always felt that bogotune is at best a
> rough guide to start the knowledgeable user on the path to good
> parameters, and that human wisdom will always be needed to get them
> really right.

That may work for you.  It may work for a few geeks and sysadmins out
there.  But it doesn't work for me and the naive masses.  In order for
bogofilter (or anything for that matter) to be widely useful is if it is
easy to use without reading a manual.

Think about a car for example.  Most people are taught to drive by
someone (probably their parents or siblings) who themselves are amateurs
and learned from amateurs.  Few people learn from professionals, and a
car is vastly more complex than a spam filter.  As such, bogofilter
should be easier to use than gas, brake, shift, steer, where a it can be
downloaded by an amateur and used intuitively with very little
instruction.  Few people care about optimal following distance or the
best fuel-air ratio in their injectors/carbeurator.  Most people don't
even keep their tires properly inflated.  Sure, this means that some
people have less effective cars than is strictly possible, but it works
well enough to get them to work in the morning.  You may dislike their
amateur driving skills, but the ease-of-use of the automobile is the
lubricant of our modern economy.  

Think about ESR's Aunt Tillie... could she drive bogofilter?  This is my
goal.  With bfproxy, I've made it possible for naive users to forward
errors to bogofilter with a small set of addresses which can be easily
set up as "contacts" without even caring what they syntactically mean. 
In this way, they can use train-on-error without understanding any of
the concepts behind it.  It's as user-friendly as training a dog with a
biscuit.

Now, I'd like to implement "tune-on-error" to supplement the training
such that it becomes easier to train, much as a puppy may become more
intelligent and learn new tricks quicker.  Again, the user does not need
to understand anything going on "under the hood", only that telling
bogofilter it was right or wrong should lead to improved filtering in
the future.  I believe this is an attainable goal, but would appreciate
a little input as to what tuning values should change how much in
response to what kind of correction.

Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20040326/04a1e782/attachment.sig>