site then user wordlist
David Relson
relson at osagesoftware.com
Thu Jun 5 03:02:38 CEST 2003
Hello Jason,
At 08:29 PM 6/4/03, Jason Sydes wrote:
>I read up a bit of your discussion regarding multiple wordlists, but it
>didn't seem like you had come to any final decision:
>http://article.gmane.org/gmane.mail.bogofilter.general/2980
I believe that's correct. Bogofilter has code to support multiple
wordlists, but I don't know of anybody actually using it. Also, though
written and included, it needs somebody with a vested interest (such as
you) to create some test messages and wordlists and see if the scoring
results are reasonable (or not).
For a first round of testing, I'd create some simple test messages, use
them to populate site/spamlist.db, site/goodlist.db, user/spamlist.db, and
user/goodlist.db. Since it's a test, I'd create messages with recognizable
tokens like "spam-user-1", "spam-user-2", "spam-site-1", "spam-either",
etc. After training (populating the wordlists), I'd classify the test
messages using bogofilter's "-vvv" switch to see how bogofilter is
combining token scores.
Additional wordlists can be specified in the config file using the wordlist
directive. The directive includes an "override" value which may be useful
to you. The list with highest override number is scanned first. If the
token is found, then lists with lower override numbers aren't looked
at. If all lists have the same override value, then counts are
acccumulated before probabilities are computed. See files wordlists.c for
the initialization code and robinson.c for the probability computations.
If you encounter problems, I'll be glad to help. Warning: I won't be
available for the next week or so as I'll be away and computer less.
>We're considering deploying bogofilter using both a site and user
>wordlist. Our system processes about 600,000 mail messages per day.
>The idea behind and initial site list is to reduce the initial training
>time for customers. Our good|spamlist.db's will be generated from the
>email of about five of our employees.
>
>We want to avoid using bogofilter -u on the site wordlist, as we have
>several mail servers mounting data over NFS, and are also concerned
>about the size the word list generated from this many users could grow.
>
>Therefore, we're hoping to classify mail using both the site and user
>wordlists, and then train just the user list based upon the result from
>the classification (the procmail filter is shown below).
>
>My question? I'm uncertain if I might be upseting the combinatorial
>gods by using two sets of probabilities to train a single set of
>probabilities. Do you see any problems inherent in this approach?
The obvious problem is conflicting information in the user and site
wordlists. Also we've seen that different people have different
definitions of what's spam. One user may want information about marketing
methods while another user may not want any sort of marketing
information. I'm sure as you get deeper into this, you'll learn much that
others do not yet know. I hope you'll share insights and discoveries with us.
David
More information about the Bogofilter
mailing list