Better default parameters: I need your help please!

Greg Louis glouis at dynamicro.on.ca
Sat Nov 22 16:41:17 CET 2003


Hi, all:

This is a requst for assistance in finding better default parameter
values for bogofilter.

Background:

Bogofilter works by calculating, for each token in a message, a number
between 0 and 1; numbers near 1 indicate that the token occurs more
often in spam, while those near 0 mean it's found more often in
nonspam.  These numbers are combined either by calculating geometric
means or by applying Fisher's method of combining probabilities; the
outcome is a single number, also between 0 and 1, where values near 1
mean bogofilter thinks the message is a spam.

If b is the number of times a token occurs in spam and bc the number of
spam messages used to count b, and g and gc are the corresponding
numbers for nonspam, then bogofilter calculates

pw = (b/bc) / ((b/bc) + (g/gc))

and if x is the value to be used when b and g are zero, s is a
weighting factor that applies to x when b and g are small and n is the
sum of b+g, then

fw = ((s * x) + (n * pw))/(s + n)

fw is the per-token value that will be used in the probability
combination.  Bogofilter allows us to ignore values of fw that are
close to 0.5, since these are weak and will normally have little
influence on the classification.  How far an fw value must deviate from
0.5 to be included in the combination step is determined by
bogofilter's mindev parameter.

Clearly, the values of s, x and mindev can significantly influence
bogofilter's classification accuracy.  That's why tools like bogotune
have been developed; there is no one optimum set of s, x and mindev
values that everyone can use, because the optima vary from site to site
depending on the message populations.

After classification, there's one more parameter that needs to be set
properly: the spam cutoff.  Given optimal s, x and mindev values, a
mail administrator can set the value above which a message will be
classified as spam.  Using higher values will give fewer false
positives (nonspam misclassified as spam), at the cost of increasing
the frequency of false negatives (spam misclassified as nonspam or
unsure).  Lower values will have the opposite effect.

The appeal:

Where does this leave the poor newbie?  We compile default values for
s, x, mindev and spam_cutoff into bogofilter; how did we choose them?

The answer is that they're simply "what worked" for one or another
developer at some time in bogofilter's life.  Whether they work for
anyone else is really quite uncertain; especially the spam-cutoff value
should be adjusted early in the process of installation and training,
to reflect user priorities (false positives are usually loathed).

Tuning s, x and mindev with a tool like bogotune is a good thing to do,
except that it needs skads of messages -- several thousand spam and
several thousand nonspam -- to work well.  It follows that we
developers ought to attempt to set default parameters that are likely
to work reasonably well for the majority of new installations, because
the defaults are all that a new user has to go on.  (We often read
uninformed statements that the GM algorithm is "more accurate" than
Fisher, or vice versa; the truth is that each, when both are properly
tuned, is exactly as accurate as the other, but the user has been
fooled because bogofilter's default GM parameters work better for him
or her than the default Fisher ones do, or vice versa.)

David recently suggested that, as part of the cleanup now in progress,
developers should try to address this question of finding good generic
defaults.  The only way I could think of to do that is to accumulate
message-count files from several sources and run them through bogotune,
separately and as aggregates, to see whether the aggregate results are
broadly acceptable.

As a first trial, we took 94,698 nonspams and 59,987 spams;
approximately a third of these were from my personal mail, another
third from a 110-user pool at my workplace, and the rest from David.
All of these were converted to message-count format, both for
efficiency and to conserve privacy.  During this process the three
corresponding training databases were used; message counts ranged from
around 20,000 each of spam and nonspam to around 50,000 each.  The
current version of the original bogotune script, written in R, was run
against this aggregate, and its recommended settings were used to
reclassify all 154,685 messages.

My home mail yielded 0.077% false positives and 15.7% false negatives;
workplace mail, 0.072% false positives and 3.4% false negatives; and
David's mail, 1.07% false positives and 3.2% false negatives.  These
are all much worse than are normally attained on site, with parameters
optimized for the individual populations.  As has been seen in various
studies carried out over the past year, bogofilter has no "one size
fits all" parameter set.

Accepting, then, that the best we can do is likely to be "one size fits
most, tolerably but not perfectly", I now appeal to readers of this
list who have access to significant email archives (not list archives)
of carefully-classified spams and nonspams to provide me with
message-count files.  I need packets of approximately 20,000 each of
recent spam and nonspam, in message-count format, preferably in bzip2
archives containing one file of nonspam message-counts and one file of
spam ones.  If you send me a url from which to download, I'll do so and
let you know when it's done.

I'm not sure if any message-count converter is supplied with bogofilter
these days, but running a command of the form

    formail -s bogol dbdir <mboxfile >messagecountfile

where dbdir (optional, default ~/.bogofilter) is where your training
database is stored, will do the job if you put this in file "bogol":

#! /bin/sh
db=~/.bogofilter
test "x$1" = "x" || db=$1
( echo .MSG_COUNT; bogolexer -p | sort -u) | \
    bogoutil -w $db | \
    awk 'NF == 3 {printf("\"%s\" %s %s\n", $1, $2, $3)}'

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |




More information about the Bogofilter mailing list