compile time options

Tue Sep 30 15:11:06 CEST 2003

On Tue, 30 Sep 2003 07:53:04 -0400
Greg Louis <glouis at dynamicro.on.ca> wrote:

> On 20030930 (Tue) at 1228:25 +0200, Joerg Over Dexia wrote:
> > Hi there,
> > 
> > Actually, I'm using Robinson-GM, and that for 2 reasons.
> > First, Robinson-Fisher gave me around 95% 0.500000 - results for
> > reasons unknown. I didn't delve into the reasons very much but
> > had the impression it had to do with my rather small database.
> > AFAIR I'm not the only one with that problem, there have been a
> > couple of postings about that phenomenon.
> > Second, like you mentioned, I like a (pseudo)linear spamicity
> > value.
> 
> It's really important to understand that you cannot obtain different
> spamicity results (in yes/no/unsure terms) by switching between GM and
> Fisher, except by using inappropriate parameter settings for one or
> both of them.  GM and Fisher are different ways of expressing the
> result of the same calculation.
> 
> The starting parameter values provided for GM may well be more
> successful with a small database than those provided for Fisher,
> purely because when we first published GM we all had smaller training
> dbs than when we published Fisher -- and in those days we didn't
> understand the underlying theory well enough to be anything but
> empirical.  It is, no doubt, well to get some experience with
> bogofilter before trying to tune it, but once the training db size
> becomes reasonable, tuning is worthwhile -- and much easier with
> Fisher than with GM.  So please entertain the possibility of switching
> some day.
> 
> David, we've discussed the somewhat haphazard derivation of our
> starting values in the past; maybe here's a criterion for choosing
> them rationally!  Shall I do a bunch of quick tunings on a number of
> small data sets and see if it's now possible to come up with Fisher
> params most newbies could live with?  (Trying to get generally usable
> values failed last time, but I think I have a much broader range of
> messages to work with now.)  Either this will help, or it will confirm
> more strongly that we need an easy quick tuning method that gives
> "best for now" values with tiny dbs.  Come to think of it, if we had
> such a thing, maybe it could even be made part of a classification run
> -- got a small db, retune it like a harpsichord for every use -- got a
> big one, treat it like a piano and tune it every 3 months :)

Greg,

You do wonderful things with algorithms, experiments, and figures.  Go
for it!!!

David