Suggested default parameter values, was Re: Hapax survival over time

Sat Mar 27 07:28:34 CET 2004

On Fri, 2004-03-26 at 11:07, Greg Louis wrote:
> > Well I'm of the opinion that robx and min_dev are very important
> > concepts that we want to take advantage of.  And I believe I am doing so
> > with my current values.  This means a significant robs and min_dev value
> > are required and useful.  If you do not use them correctly though, you
> > will of course get better results by minimizing their effect, which is
> > precisely what you and some others on this list appear to be doing. 
> > Perhaps some assumptions are made in bogotune which lead to this?
> 
> Why not quit believing and start proving?

My own values are well-proven on my database and seem to coincide
roughly with other published numbers on this list.  Conjecture about why
anyone else would do well by minimizing the effect of theoretically and
practically useful values is simply a thought experiment as I don't keep
large volumes of archived mail, nor do I have the time or interest to
run such an experiment if I did.  My bogofilter runs as I would
theoretically expect it to, with robx, robs, and min_dev apparently
performing their intended function.

> We who use small s and/or min_dev use them as we do because
> grid-testing the whole range has shown that it works better for us. 

I'm not in the position to audit your testing process.  I'm just
floating the idea that there may be a reason why your testing shows what
it does, and that reason may not be that robs and min_dev are useless. 
You yourself voiced your confusion ("Gary and I have scratched our
heads") over the apparent disconnect between your results and the
theoretical predictions.

> > It may be that your entrentched theory is entirely wrong and that you
> > are trying to bend values to fit your theory.
> 
> No, I am refusing to let the theory distort what is intrinsically an
> empirical, brute-force search for values that work.  Failure to explain
> the results of such a brute-force search invalidates the theory, not
> the other way around.

Sometimes there's more than one explanation for a given event.  Let's
not throw away the theory based on the results of an experiment unless
we know for certain that the experiment was actually testing the theory
or perhaps there were some other variables involved.  What you refer to
as intrinsically empirical is of course tainted by your own
expectations.  That is why double-blind experiments are necessary.  I
myself barely have time to offer mild brainstorming as a response to
your experiment, but I hope others will repeat it.

> You surprise me.  I have been training on error ever since my message
> db got to 10,000 of each.

> I have tried training repetitively a couple of times, and training
> purely on error a couple of times, and found either no effect or a
> negative effect.  Others have had different results.  Similarly, I know

Why do you act surprised in the first instance, but then immediately
confirm my observation in your following statement?  Others have had
different results from you... that's precisely what I said.  I had no
other point to make but that.  The difference tends to stem from certain
assumptions made about the training methods.  Pi has pointed this out
more than once on this list.

> relevant; what we're trying to accomplish at the moment is to find a
> set of parameter values that will do reasonably well -- as a starting
> point -- in a wide variety of environments, with a wide variety of

If that is your goal, I don't think you'll find it with strictly
empirical tests on several test sets, as you'll invariably find
variation from set to set, and you could test forever without converging
on the "best" starting numbers.

I think most would agree that users should start with "conservative"
values at first, and then become more aggressive according to how their
mail is being classified.  I doubt there is any more optimal default for
the cutoffs than spam_cutoff=1, ham_cutoff=0.  I do not need empirical
tests to know that this will be at least as good as having no filter at
all, and a fine place to start from.  The other variables are not
terribly important precisely where they start, except that robx is
within 0.5 +/- min_dev, and robs and min_dev should be significant
enough such that tokens registered only once don't greatly affect
classifications.  If I've registered 5 ham and 5 spam, bogofilter should
not be too confident yet in making predictions based on these.

That basic description would be my "ideal" starting condition for all of
my users.  I would then try to use "tune-on-error" to produce values
which perform better.

> corpora. I am not interested in forcing any individual to use them;
> only in setting bogofilter's distribution defaults to something that
> works decently, though perhaps not optimally, for the widest possible

I'm not confident that starting out with a "smart" statistical filter is
a good idea.  I'd rather start it dumb and let it learn.  As soon as you
try to anticipate "decent" defaults, you'll run into cases where they
are not decent for some users, and false positives will result.  A dumb
filter will not assume anything about your email until you tell it to.

> > starting to wonder why your results differ so profoundly.  It seems
> > to me that if you have not found significant values of robs to be
> > practical, then you are misusing robx such that minimizing the affect
> > of this misuse is useful.
> 
> That's just nonsense.  You have the code, go see for yourself.  I've
> tried large s and for me it simply does not work (except when I was
> trying train-on-error with no prior full training, and the database was
> tiny) as well as smaller values do.  If I cared why your results differ
> so greatly from mine, my starting point for investigation would have to
> do with the big difference in training method between you and me.

Well that could be your answer.  It's perfectly sensible.  Large s does
work for you when you train-on-error with no full training.  The full
training, it would seem, is causing the disconnect between theoretical
predictions and your experimental results.  That narrows down the
problem substantially.

> > Think about ESR's Aunt Tillie... could she drive bogofilter?  This is my
> > goal.
> 
> I have no confidence in your ability to reach that goal, but I wouldn't
> object to being proven wrong :)

I'll strive to do so ;)

Tom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20040327/6fcaea68/attachment.sig>