site then user wordlist

Jason Sydes jason at hq.newdream.net
Thu Jun 5 02:29:22 CEST 2003


I read up a bit of your discussion regarding multiple wordlists, but it 
didn't seem like you had come to any final decision:
http://article.gmane.org/gmane.mail.bogofilter.general/2980

We're considering deploying bogofilter using both a site and user
wordlist.  Our system processes about 600,000 mail messages per day. 
The idea behind and initial site list is to reduce the initial training
time for customers.  Our good|spamlist.db's will be generated from the
email of about five of our employees.

We want to avoid using bogofilter -u on the site wordlist, as we have
several mail servers mounting data over NFS, and are also concerned
about the size the word list generated from this many users could grow.

Therefore, we're hoping to classify mail using both the site and user 
wordlists, and then train just the user list based upon the result from 
the classification (the procmail filter is shown below).  

My question?  I'm uncertain if I might be upseting the combinatorial
gods by using two sets of probabilities to train a single set of
probabilities.  Do you see any problems inherent in this approach?


# Procmail fun
# Rate and tag message using both site and user lists.
:0fw
| bogofilter -e -p -c /etc/bogofilter_global.cf

# Update only personal list according to results from test above.

   # Message is spam, update user spamlist.db
   :0
   * ^X-Bogosity: Yes
   | bogofilter -s

   # Message is ham, update user goodlist.db
   :0
   * ^X-Bogosity: No
   | bogofilter -n


Regards,
Jason





More information about the Bogofilter mailing list