Suggested default parameter values, was Re: Hapax survival over time

Fri Mar 26 17:07:34 CET 2004

On 20040326 (Fri) at 0932:06 -0500, Tom Anderson wrote:
> On Thu, 2004-03-25 at 09:37, Greg Louis wrote:
> > It's important to take into account that the distribution of message
> > scores can change drastically with alterations in bogofilter's
> > parameters.  If you use the new parameter values, you will find that
> > you need higher cutoffs than before (though your optima may indeed be
> > different from those I obtained in my test).
> 
> I started my database with the old defaults and have slowly made tweaks
> in both directions with all of my values.  My existing set is the best
> result of that process so far.  Are you saying that if I started with a
> different set of defaults, my configuration would converge to a
> completely different set of values?  If that is true, we need some
> chaoticians in here to explain it ;)

No, but the route would be very different.  If you tried the suggested
default s, x and min_dev, I meant to say, you will need a high cutoff
to get reasonable false-positive counts.  If you then start tweaking,
you may well end up back at your original best values.

> I'm quite certain that a spam cutoff of 0.99 will create far too many
> false negatives, and if this value were to be used as a default, it
> would have to be decreased very quickly for anyone to get the kind of
> results they would expect (>80% filtration).

Well, I assert that you are dead wrong about that.  I've used parameter
sets in the past that needed 0.998 or so and still gave reasonably few
fn (under 5%).

> > With the small s values it's moot.
> 
> It's not moot because the min_dev value plays a large role even with a
> small s.  Centering the range at a point between the cutoffs instead of
> 0.5 would mean that the area where you expect a token to be undecidable
> coincides with the range of scores in which you'd expect an email to be
> undecidable.  Granted these two concepts don't wholly overlap, but that
> may be part of the problem.  If I receive an email containing a single
> word, then it would be expected that the email score would equal the
> token score.

What is moot with the small s values is the question of where to put x,
not where to put min_dev.  Sorry; I missed the major point re
re-centering min_dev.  Given the enormous change in message-score
distribution that a smallish change in s, x and/or min_dev can produce,
I can't foresee what the effect of such a change might be. 
Experimentation required.

> As it is now, however, I've seen claims of tiny min_devs (~0.01)
> centered around 0.5, but cutoffs nowhere near 0.5 (~0.1-0.2).  This
> makes little sense.

I haven't seen any such.  The lowest spam cutoff I've ever seen used
was somewhere around 0.45, and I myself have never used one below 0.49
(I powerfully mistrust spam cutoffs under 0.55 though I'm currently
doing well with 0.5322 in production.)
> In such a situation, 0.5 has no meaning whatsoever,
> which is why a very small min_dev is required (to make 0.5 meaningless
> as it actually is).  This may also be why your test showed such a tiny
> robs value... because your other values made robx meaningless.

In my testing I set an s, a min_dev and an x and determine the required
spam cutoff, not the other way round.  I test the range of min_dev from
0.02 to 0.45 (and even up to 0.465), the range of s from 1 to 0.01 and
the range of x from the calculated starting point to 0.1 either side of
that.  The values I report give the best accuracy with the messages
tested.  I want to understand why, but I value what works, whether or
not I understand why at present.

> Well I'm of the opinion that robx and min_dev are very important
> concepts that we want to take advantage of.  And I believe I am doing so
> with my current values.  This means a significant robs and min_dev value
> are required and useful.  If you do not use them correctly though, you
> will of course get better results by minimizing their effect, which is
> precisely what you and some others on this list appear to be doing. 
> Perhaps some assumptions are made in bogotune which lead to this?

Why not quit believing and start proving?

We who use small s and/or min_dev use them as we do because
grid-testing the whole range has shown that it works better for us. 
I've tried s values as high as 10 and never seen an advantage to going
much above 0.1.

> It may be that your entrentched theory is entirely wrong and that you
> are trying to bend values to fit your theory.

No, I am refusing to let the theory distort what is intrinsically an
empirical, brute-force search for values that work.  Failure to explain
the results of such a brute-force search invalidates the theory, not
the other way around.

> I've noticed that you've maintained that train-on-error and
> training-to-exhaustion is entirely useless

You surprise me.  I have been training on error ever since my message
db got to 10,000 of each.

> (or at least that you personally have found no positive effect)
> whereas I and others on this list have found it very useful.

I have tried training repetitively a couple of times, and training
purely on error a couple of times, and found either no effect or a
negative effect.  Others have had different results.  Similarly, I know
users for whom 0.45 or 0.465 work best as min_dev values; I do best at
0.02.  I have no intention of using the proposed default parameters in
my own production environment, because I already know values that work
better for that environment; you're in the same situation.  That's not
relevant; what we're trying to accomplish at the moment is to find a
set of parameter values that will do reasonably well -- as a starting
point -- in a wide variety of environments, with a wide variety of
corpora. I am not interested in forcing any individual to use them;
only in setting bogofilter's distribution defaults to something that
works decently, though perhaps not optimally, for the widest possible
range of new users.  I haven't any interest at all (for now) in why,
only in whether, such a parameter set exists or doesn't.  If anyone
wants to influence the values, they need to produce experimental
results, not theory; theory has a way to go yet before it explains
already observed fact, and until it does that, I won't trust it for
predicting anything.

> In this same vein, I have found that increasing robs has been useful. 
> Now, I don't want to assume that I'm doing something right and you're
> doing something wrong, as I maintain that I am likely the more
> amateur in regard to the statistical math going on here, but I'm
> starting to wonder why your results differ so profoundly.  It seems
> to me that if you have not found significant values of robs to be
> practical, then you are misusing robx such that minimizing the affect
> of this misuse is useful.

That's just nonsense.  You have the code, go see for yourself.  I've
tried large s and for me it simply does not work (except when I was
trying train-on-error with no prior full training, and the database was
tiny) as well as smaller values do.  If I cared why your results differ
so greatly from mine, my starting point for investigation would have to
do with the big difference in training method between you and me.

---

> Think about ESR's Aunt Tillie... could she drive bogofilter?  This is my
> goal.

I have no confidence in your ability to reach that goal, but I wouldn't
object to being proven wrong :)

> a little input as to what tuning values should change how much in
> response to what kind of correction.

The best I can offer you is to suggest a way of investigating that:
make big batched corrections and see how the optimal parameters change,
and try to derive rules from that.  Predicting the effect of a single
registration or unregistration on the basis of our present
understanding is really not something I'd be prepared to attempt.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |