multiple wordlists

Sat Mar 15 23:29:43 CET 2003

Greetings,

Bogofilter's initial implementation assumed that it would use a pair of 
wordlists, named spamlist.db and goodlist.db, for scoring an email message 
and classifying it as spam or ham.  For increased flexibility, each email 
userid can have its own wordlists.  At the beginning environment variables 
$BOGOFILTER_DIR and $HOME were used to name the directory with the 
wordlists.  As bogofilter evolved, it became possible to specify the 
directory with a command line switch, e.g. "-d bogodir", or with a config 
file option, e.g. "bogofilter_dir=bogodir".  (If more than one of these 
mechanisms is used, the command line has highest precedence, followed by 
the config file, and the environment variable has lowest precedence).

The idea of handling multiple wordlists has been around almost as long as 
bogofilter has existed.  In fact there _is_ some code in bogofilter to 
facilitate working with multiple wordlists.  However, as I have come to 
realize, there are several different ways to approach the question of how 
best to handle multiple wordlists.  In some private discussions, Greg has 
helped to expand and clarify the ideas below.

First, let's assume that there are two sets of wordlists involved and let's 
call them "site" and "user".  "System" and "personal" would be equally good 
names, but I'm going to use the shorter names.

Given that there are two sets of wordlists, how will a token's spamicity be 
calculated?

In the current method, a token is looked up in spamlist.db and goodlist.db 
and is given goodness and badness scores.  The scores are computed by 
taking the count the token has in each wordlist and dividing that count by 
the number of messages in the wordlist.  Here's a sample:

              spam   good  p-spam  p-good  spamicity
token        1500   1000  0.2000  0.1000  0.6667
.MSG_COUNT   7500  10000

The formulae are:

	badness = spam(word)/msg_count(spam)
	goodness = good(word)/msg_count(good)
	spamicity = badness/(badness + goodness)

That's how it presently works - with 1 pair of wordlists.  Let's say the 
above sample is from the site wordlists.  Below is sample data from a 
user's wordlists (which are likely to have smaller counts, at least initially):

              spam   good  p-spam  p-good  spamicity
token          90    240  0.1000  0.2000  0.3333
.MSG_COUNT    900   1200

In case you're wondering, the numbers were selected to have easily computed 
probabilities and spamicities.

Now that the background is established, we get to the subject at hand - how 
to deal with two (possibly more) wordlist pairs.  I'm aware of fours ways 
of working with the two lists.

1 - simply add the numbers to create a combined list.  With the numbers 
above, the combined list would look like:

              spam   good  p-spam  p-good  spamicity
token        1590   1240  0.1893  0.1107  0.6310
.MSG_COUNT   8400  11200

With these numbers, the spamicity score for the one word message "token" is 
0.6310

2 - If "token" is in two lists, use it in the calculation twice, as though 
"site.token" and "user.token" were two separate tokens.  This is comparable 
to scaling values so that the user and site lists have comparable message 
counts.  The numbers are:

                  spam   good  p-spam  p-good  spamicity
site.token       1500   1000  0.2000  0.1000  0.6667
user.token         90    240  0.1000  0.2000  0.3333
site.MSG_COUNT    900   1200
user.MSG_COUNT   8400  11200

With these numbers, the spamicity score for the one word message "token" is 
0.5000

3 - Give the user list precedence over the site list.  Since the user list 
is (presumably) trained by the user's personal ideas of what is spam and 
what is ham, while the site list is based on a corporate standard (or the 
sysadmin's spam/ham ideas), it makes sense for a user's spam/ham ideas to 
be more important.

In this case, the one word message "token" would have a spamicity of 0.3333 
(since the value is taken from the user wordlists).

4 - Weight the two lists, i.e. give higher importance to token values in 
the user list.  The relevant formula would be:

     p(w,weighted) = (W*p(w,user) + p(w,site))/(W+1)

     Using a weighting factor would smooth the transition from site to user 
wordlists.  The weighting factor, W, could have an initial value of, say, 
10 and there could be a threshold value so that when the user lists get big 
enough (contain enough messages) the site scores would no longer be used.

Note that there is no particular reason to limit bogofilter to just two 
wordlist pairs.  Conceivably a "standard" spam wordlist could come into 
existence and a site might want to use it in addition to the site wordlists 
and the user wordlists.  Of the above options, with 1 and 2 the time to 
classify a message doubles (or triples) as the number of lists increases to 
2 (or 3).  The speed of option 3 probably doesn't change much as the number 
of lists grows because the first wordlist pair will do most of the work 
(and will get the training changes).  I think that option 4 doesn't extend 
well beyond 2 list.

So, given all this information, the question is "How should bogofilter deal 
with multiple wordlists?"

David