Suggested default parameter values, was Re: Hapax survival over time

Wed Mar 24 19:29:14 CET 2004

On 20040324 (Wed) at 0909:09 -0500, Tom Anderson wrote:
> 
> Following the above argument, when bogofilter first "looks up" a new
> word, it should not have high confidence in how that word will be used
> in future instances.  Therefore, it should not move too far from robx. 
> I would want at least several registrations before it moves out of my
> min_dev range and starts affecting classifications.

That, of course, being exactly what robs is intended to control. 
Curiously -- I still don't understand why but I've repeated it over and
over -- most training databases with reasonable numbers of messages
registered (>10,000 say) will produce much more accurate results with
values of s in the range of 0.01 to 0.03 than they do with s up around
1, which is where you'd expect it to be useful.  With s that small,
it's really only affecting unknowns, not hapaxes, so I guess that tells
us that hapaxes can be useful (particularly if they're only hapaxes
because you're training on error).

Mind you, when Gary first proposed the s and x parameters, there was no
such thing as a min_dev.  That was a separate concept; we tried
dropping Paul's extrema and looking at every score, and it didn't do
badly; then I introduced min_dev as a less extreme (pun intended) way
of eliminating noise.  At first it seemed that 0.1 worked better than
anything lower or higher, which is how that became the distribution
default; later, many people found that much bigger min_dev values did
well for them, like extrema used to; others went the other way and get
better results with 0.02 or so.

Which segues nicely into a related topic: with the aid of Andrew Partan
and David, I've assembled four large corpora from four different
sources and used them to determine a set of parameters that do
reasonably well for all four.  Before doing that, I ran bogotune on
each individually, using just the wordlist supplied with each, to get
the optimal parameters for each corpus separately; they differed quite
widely.  Then I built a merged wordlist and used that to get global
parameters for the merged corpora.  Once these new parameters had been
determined, the individual corpora were re-evaluated so that the
accuracy with the general parameter set could be compared with that
obtained with the individual optima.  In the following table, the fp
and fn columns present percentages of false positives and false
negatives achieved with the new parameters; the ofp and ofn columns
present the corresponding percentages obtained with the individual
bogotuned parameters:

 source    fp   fn   ofp  ofn    spam nonspam
     ap 0.008 0.39  0.01 0.47  76,324 132,849
     dr 0.030 4.87  0.05 2.74  56,172  60,462
    csl 0.180 5.45  0.20 3.59  60,120  62,312
     gl 0.013 1.19  0.01 0.99  52,127  51,951

Interestingly, the common parameter set actually proved _better_ than
what bogotune came up with for the ap messages alone.  For the dr and
csl corpora, the difference was minor (and could likely be almost
eliminated by adjusting the cutoff slightly).  Only my own mail's
results were notably improved when parameters determined with that
corpus were applied.

The suggested parameter values are:

robx=0.52
min_dev=0.375
robs=0.0178
spam_cutoff=0.99
ham_cutoff=0.45 (or 0 if one prefers binary evaluation)

and I am suggesting we make those the new defaults in the bogofilter
distribution.  People might like to try them (adjusting the spam cutoff
to give an appropriate false-positive / false-negative ratio) and see
if they really are widely applicable.  (If reporting results, please
give all of the figures as above: values with these parameters, values
with your normal favourites, number of spam messages tested and number
of nonspam messages tested.  Also desirable are the message counts and
number of tokens in your training db at the time of testing.)

This experiment will be written up in detail on my bogofilter web site
in the next few days, and I'll announce it on the list when that's
done.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |