site then user wordlist

Thu Jun 5 03:02:38 CEST 2003

Hello Jason,

At 08:29 PM 6/4/03, Jason Sydes wrote:
>I read up a bit of your discussion regarding multiple wordlists, but it
>didn't seem like you had come to any final decision:
>http://article.gmane.org/gmane.mail.bogofilter.general/2980

I believe that's correct.  Bogofilter has code to support multiple 
wordlists, but I don't know of anybody actually using it.  Also, though 
written and included, it needs somebody with a vested interest (such as 
you) to create some test messages and wordlists and see if the scoring 
results are reasonable (or not).

For a first round of testing, I'd create some simple test messages, use 
them to populate site/spamlist.db, site/goodlist.db, user/spamlist.db, and 
user/goodlist.db.  Since it's a test, I'd create messages with recognizable 
tokens like "spam-user-1", "spam-user-2", "spam-site-1", "spam-either", 
etc.  After training (populating the wordlists), I'd classify the test 
messages using bogofilter's "-vvv" switch to see how bogofilter is 
combining token scores.

Additional wordlists can be specified in the config file using the wordlist 
directive.  The directive includes an "override" value which may be useful 
to you.  The list with highest override number is scanned first.  If the 
token is found, then lists with lower override numbers aren't looked 
at.  If all lists have the same override value, then counts are 
acccumulated before probabilities are computed.  See files wordlists.c for 
the initialization code and robinson.c for the probability computations.

If you encounter problems, I'll be glad to help.  Warning:  I won't be 
available for the next week or so as I'll be away  and computer less.

>We're considering deploying bogofilter using both a site and user
>wordlist.  Our system processes about 600,000 mail messages per day.
>The idea behind and initial site list is to reduce the initial training
>time for customers.  Our good|spamlist.db's will be generated from the
>email of about five of our employees.
>
>We want to avoid using bogofilter -u on the site wordlist, as we have
>several mail servers mounting data over NFS, and are also concerned
>about the size the word list generated from this many users could grow.
>
>Therefore, we're hoping to classify mail using both the site and user
>wordlists, and then train just the user list based upon the result from
>the classification (the procmail filter is shown below).
>
>My question?  I'm uncertain if I might be upseting the combinatorial
>gods by using two sets of probabilities to train a single set of
>probabilities.  Do you see any problems inherent in this approach?

The obvious problem is conflicting information in the user and site 
wordlists.  Also we've seen that different people have different 
definitions of what's spam.  One user may want information about marketing 
methods while another user may not want any sort of marketing 
information.  I'm sure as you get deeper into this, you'll learn much that 
others do not yet know.  I hope you'll share insights and discoveries with us.

David