Suggested default parameter values, was Re: Hapax survival over time
glouis at dynamicro.on.ca
Fri Mar 26 11:07:34 EST 2004
On 20040326 (Fri) at 0932:06 -0500, Tom Anderson wrote:
> On Thu, 2004-03-25 at 09:37, Greg Louis wrote:
> > It's important to take into account that the distribution of message
> > scores can change drastically with alterations in bogofilter's
> > parameters. If you use the new parameter values, you will find that
> > you need higher cutoffs than before (though your optima may indeed be
> > different from those I obtained in my test).
> I started my database with the old defaults and have slowly made tweaks
> in both directions with all of my values. My existing set is the best
> result of that process so far. Are you saying that if I started with a
> different set of defaults, my configuration would converge to a
> completely different set of values? If that is true, we need some
> chaoticians in here to explain it ;)
No, but the route would be very different. If you tried the suggested
default s, x and min_dev, I meant to say, you will need a high cutoff
to get reasonable false-positive counts. If you then start tweaking,
you may well end up back at your original best values.
> I'm quite certain that a spam cutoff of 0.99 will create far too many
> false negatives, and if this value were to be used as a default, it
> would have to be decreased very quickly for anyone to get the kind of
> results they would expect (>80% filtration).
Well, I assert that you are dead wrong about that. I've used parameter
sets in the past that needed 0.998 or so and still gave reasonably few
fn (under 5%).
> > With the small s values it's moot.
> It's not moot because the min_dev value plays a large role even with a
> small s. Centering the range at a point between the cutoffs instead of
> 0.5 would mean that the area where you expect a token to be undecidable
> coincides with the range of scores in which you'd expect an email to be
> undecidable. Granted these two concepts don't wholly overlap, but that
> may be part of the problem. If I receive an email containing a single
> word, then it would be expected that the email score would equal the
> token score.
What is moot with the small s values is the question of where to put x,
not where to put min_dev. Sorry; I missed the major point re
re-centering min_dev. Given the enormous change in message-score
distribution that a smallish change in s, x and/or min_dev can produce,
I can't foresee what the effect of such a change might be.
> As it is now, however, I've seen claims of tiny min_devs (~0.01)
> centered around 0.5, but cutoffs nowhere near 0.5 (~0.1-0.2). This
> makes little sense.
I haven't seen any such. The lowest spam cutoff I've ever seen used
was somewhere around 0.45, and I myself have never used one below 0.49
(I powerfully mistrust spam cutoffs under 0.55 though I'm currently
doing well with 0.5322 in production.)
> In such a situation, 0.5 has no meaning whatsoever,
> which is why a very small min_dev is required (to make 0.5 meaningless
> as it actually is). This may also be why your test showed such a tiny
> robs value... because your other values made robx meaningless.
In my testing I set an s, a min_dev and an x and determine the required
spam cutoff, not the other way round. I test the range of min_dev from
0.02 to 0.45 (and even up to 0.465), the range of s from 1 to 0.01 and
the range of x from the calculated starting point to 0.1 either side of
that. The values I report give the best accuracy with the messages
tested. I want to understand why, but I value what works, whether or
not I understand why at present.
> Well I'm of the opinion that robx and min_dev are very important
> concepts that we want to take advantage of. And I believe I am doing so
> with my current values. This means a significant robs and min_dev value
> are required and useful. If you do not use them correctly though, you
> will of course get better results by minimizing their effect, which is
> precisely what you and some others on this list appear to be doing.
> Perhaps some assumptions are made in bogotune which lead to this?
Why not quit believing and start proving?
We who use small s and/or min_dev use them as we do because
grid-testing the whole range has shown that it works better for us.
I've tried s values as high as 10 and never seen an advantage to going
much above 0.1.
> It may be that your entrentched theory is entirely wrong and that you
> are trying to bend values to fit your theory.
No, I am refusing to let the theory distort what is intrinsically an
empirical, brute-force search for values that work. Failure to explain
the results of such a brute-force search invalidates the theory, not
the other way around.
> I've noticed that you've maintained that train-on-error and
> training-to-exhaustion is entirely useless
You surprise me. I have been training on error ever since my message
db got to 10,000 of each.
> (or at least that you personally have found no positive effect)
> whereas I and others on this list have found it very useful.
I have tried training repetitively a couple of times, and training
purely on error a couple of times, and found either no effect or a
negative effect. Others have had different results. Similarly, I know
users for whom 0.45 or 0.465 work best as min_dev values; I do best at
0.02. I have no intention of using the proposed default parameters in
my own production environment, because I already know values that work
better for that environment; you're in the same situation. That's not
relevant; what we're trying to accomplish at the moment is to find a
set of parameter values that will do reasonably well -- as a starting
point -- in a wide variety of environments, with a wide variety of
corpora. I am not interested in forcing any individual to use them;
only in setting bogofilter's distribution defaults to something that
works decently, though perhaps not optimally, for the widest possible
range of new users. I haven't any interest at all (for now) in why,
only in whether, such a parameter set exists or doesn't. If anyone
wants to influence the values, they need to produce experimental
results, not theory; theory has a way to go yet before it explains
already observed fact, and until it does that, I won't trust it for
> In this same vein, I have found that increasing robs has been useful.
> Now, I don't want to assume that I'm doing something right and you're
> doing something wrong, as I maintain that I am likely the more
> amateur in regard to the statistical math going on here, but I'm
> starting to wonder why your results differ so profoundly. It seems
> to me that if you have not found significant values of robs to be
> practical, then you are misusing robx such that minimizing the affect
> of this misuse is useful.
That's just nonsense. You have the code, go see for yourself. I've
tried large s and for me it simply does not work (except when I was
trying train-on-error with no prior full training, and the database was
tiny) as well as smaller values do. If I cared why your results differ
so greatly from mine, my starting point for investigation would have to
do with the big difference in training method between you and me.
> Think about ESR's Aunt Tillie... could she drive bogofilter? This is my
I have no confidence in your ability to reach that goal, but I wouldn't
object to being proven wrong :)
> a little input as to what tuning values should change how much in
> response to what kind of correction.
The best I can offer you is to suggest a way of investigating that:
make big batched corrections and see how the optimal parameters change,
and try to derive rules from that. Predicting the effect of a single
registration or unregistration on the basis of our present
understanding is really not something I'd be prepared to attempt.
| G r e g L o u i s | gpg public key: 0x400B1AA86D9E3E64 |
| http://www.bgl.nu/~glouis | (on my website or any keyserver) |
| http://wecanstopspam.org in signatures helps fight junk email. |
More information about the Bogofilter