compile time options

Tue Sep 30 13:53:04 CEST 2003

On 20030930 (Tue) at 1228:25 +0200, Joerg Over Dexia wrote:
> Hi there,
> 
> Actually, I'm using Robinson-GM, and that for 2 reasons.
> First, Robinson-Fisher gave me around 95% 0.500000 - results for
> reasons unknown. I didn't delve into the reasons very much but
> had the impression it had to do with my rather small database.
> AFAIR I'm not the only one with that problem, there have been a
> couple of postings about that phenomenon.
> Second, like you mentioned, I like a (pseudo)linear spamicity
> value.

It's really important to understand that you cannot obtain different
spamicity results (in yes/no/unsure terms) by switching between GM and
Fisher, except by using inappropriate parameter settings for one or
both of them.  GM and Fisher are different ways of expressing the
result of the same calculation.

The starting parameter values provided for GM may well be more
successful with a small database than those provided for Fisher, purely
because when we first published GM we all had smaller training dbs than
when we published Fisher -- and in those days we didn't understand the
underlying theory well enough to be anything but empirical.  It is, no
doubt, well to get some experience with bogofilter before trying to
tune it, but once the training db size becomes reasonable, tuning is
worthwhile -- and much easier with Fisher than with GM.  So please
entertain the possibility of switching some day.

David, we've discussed the somewhat haphazard derivation of our
starting values in the past; maybe here's a criterion for choosing them
rationally!  Shall I do a bunch of quick tunings on a number of small
data sets and see if it's now possible to come up with Fisher params
most newbies could live with?  (Trying to get generally usable values
failed last time, but I think I have a much broader range of messages
to work with now.)  Either this will help, or it will confirm more
strongly that we need an easy quick tuning method that gives "best for
now" values with tiny dbs.  Come to think of it, if we had such a
thing, maybe it could even be made part of a classification run -- got
a small db, retune it like a harpsichord for every use -- got a big
one, treat it like a piano and tune it every 3 months :)

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |