multiple wordlists
David Relson
relson at osagesoftware.com
Sat Mar 15 23:29:43 CET 2003
Greetings,
Bogofilter's initial implementation assumed that it would use a pair of
wordlists, named spamlist.db and goodlist.db, for scoring an email message
and classifying it as spam or ham. For increased flexibility, each email
userid can have its own wordlists. At the beginning environment variables
$BOGOFILTER_DIR and $HOME were used to name the directory with the
wordlists. As bogofilter evolved, it became possible to specify the
directory with a command line switch, e.g. "-d bogodir", or with a config
file option, e.g. "bogofilter_dir=bogodir". (If more than one of these
mechanisms is used, the command line has highest precedence, followed by
the config file, and the environment variable has lowest precedence).
The idea of handling multiple wordlists has been around almost as long as
bogofilter has existed. In fact there _is_ some code in bogofilter to
facilitate working with multiple wordlists. However, as I have come to
realize, there are several different ways to approach the question of how
best to handle multiple wordlists. In some private discussions, Greg has
helped to expand and clarify the ideas below.
First, let's assume that there are two sets of wordlists involved and let's
call them "site" and "user". "System" and "personal" would be equally good
names, but I'm going to use the shorter names.
Given that there are two sets of wordlists, how will a token's spamicity be
calculated?
In the current method, a token is looked up in spamlist.db and goodlist.db
and is given goodness and badness scores. The scores are computed by
taking the count the token has in each wordlist and dividing that count by
the number of messages in the wordlist. Here's a sample:
spam good p-spam p-good spamicity
token 1500 1000 0.2000 0.1000 0.6667
.MSG_COUNT 7500 10000
The formulae are:
badness = spam(word)/msg_count(spam)
goodness = good(word)/msg_count(good)
spamicity = badness/(badness + goodness)
That's how it presently works - with 1 pair of wordlists. Let's say the
above sample is from the site wordlists. Below is sample data from a
user's wordlists (which are likely to have smaller counts, at least initially):
spam good p-spam p-good spamicity
token 90 240 0.1000 0.2000 0.3333
.MSG_COUNT 900 1200
In case you're wondering, the numbers were selected to have easily computed
probabilities and spamicities.
Now that the background is established, we get to the subject at hand - how
to deal with two (possibly more) wordlist pairs. I'm aware of fours ways
of working with the two lists.
1 - simply add the numbers to create a combined list. With the numbers
above, the combined list would look like:
spam good p-spam p-good spamicity
token 1590 1240 0.1893 0.1107 0.6310
.MSG_COUNT 8400 11200
With these numbers, the spamicity score for the one word message "token" is
0.6310
2 - If "token" is in two lists, use it in the calculation twice, as though
"site.token" and "user.token" were two separate tokens. This is comparable
to scaling values so that the user and site lists have comparable message
counts. The numbers are:
spam good p-spam p-good spamicity
site.token 1500 1000 0.2000 0.1000 0.6667
user.token 90 240 0.1000 0.2000 0.3333
site.MSG_COUNT 900 1200
user.MSG_COUNT 8400 11200
With these numbers, the spamicity score for the one word message "token" is
0.5000
3 - Give the user list precedence over the site list. Since the user list
is (presumably) trained by the user's personal ideas of what is spam and
what is ham, while the site list is based on a corporate standard (or the
sysadmin's spam/ham ideas), it makes sense for a user's spam/ham ideas to
be more important.
In this case, the one word message "token" would have a spamicity of 0.3333
(since the value is taken from the user wordlists).
4 - Weight the two lists, i.e. give higher importance to token values in
the user list. The relevant formula would be:
p(w,weighted) = (W*p(w,user) + p(w,site))/(W+1)
Using a weighting factor would smooth the transition from site to user
wordlists. The weighting factor, W, could have an initial value of, say,
10 and there could be a threshold value so that when the user lists get big
enough (contain enough messages) the site scores would no longer be used.
Note that there is no particular reason to limit bogofilter to just two
wordlist pairs. Conceivably a "standard" spam wordlist could come into
existence and a site might want to use it in addition to the site wordlists
and the user wordlists. Of the above options, with 1 and 2 the time to
classify a message doubles (or triples) as the number of lists increases to
2 (or 3). The speed of option 3 probably doesn't change much as the number
of lists grows because the first wordlist pair will do most of the work
(and will get the training changes). I think that option 4 doesn't extend
well beyond 2 list.
So, given all this information, the question is "How should bogofilter deal
with multiple wordlists?"
David
More information about the Bogofilter
mailing list