compile time options

Tue Sep 30 14:36:45 CEST 2003

Hello,

Am 07:53 30.09.2003 -0400 teilte Greg Louis mir folgendes mit:
->It's really important to understand that you cannot obtain
different
->spamicity results (in yes/no/unsure terms) by switching between
GM and
->Fisher, except by using inappropriate parameter settings for
one or
->both of them.  GM and Fisher are different ways of expressing
the
->result of the same calculation.

Well, I do believe I do understand that. I however am able to
obtain different spamicity results numerically, and that's what
I'm interested in for several reasons, one of them being a
visible immediate effect when registering spam and nonspam.
Another one along the same line of thought is that I can see if
registering doesn't seem to have an effect, like with the bug
with the -u - switch and different databases for spam and ham
lately.
I know I can probably diagnose that with a couple of switches,
too, but that way I have it quite comfortable, right there, on my
incoming mail. (There are other reasons, too; good reasons, I
believe, but that might lead to far.)

The more distinctive classification Fisher provides might be
advantageous for some circumstances, but, honestly, I fail to see
those.

It is, no
->doubt, well to get some experience with bogofilter before
trying to
->tune it, but once the training db size becomes reasonable,
tuning is
->worthwhile -- and much easier with Fisher than with GM.  So
please
->entertain the possibility of switching some day.

The reasoning is quite convincing, but a lot of people who could
put bogofilter to a good use aren't in the position to aquire a
reasonably sized training db I believe; count me among them. That
hasn't only to do with mail traffic volume, it also touches
privacy issues, especially for the ham corpus, of course.

->David, we've discussed the somewhat haphazard derivation of our
->starting values in the past; maybe here's a criterion for
choosing them
->rationally!  Shall I do a bunch of quick tunings on a number of
small
->data sets and see if it's now possible to come up with Fisher
params
->most newbies could live with?  (Trying to get generally usable
values
->failed last time, but I think I have a much broader range of
messages
->to work with now.)  Either this will help, or it will confirm
more
->strongly that we need an easy quick tuning method that gives
"best for
->now" values with tiny dbs.  Come to think of it, if we had such
a
->thing, maybe it could even be made part of a classification run
-- got
->a small db, retune it like a harpsichord for every use -- got a
big
->one, treat it like a piano and tune it every 3 months :)

I'd definitely appreciate both of this, the default values and
the quick tuning method. I'd however probably not switch to
Fisher. (Actually, I'd probably rather switch to a different
bayes filter than switch to Fisher.) As you say, it doesn't
change spamicity results (in yes/no/unsure terms) and the
advantage in tuning doesn't outweigh the loss of accessibility of
the calculation result in my small db world :)

Greetings, jo