bogofilter setup for 50000+ users

David Relson relson at osagesoftware.com
Fri Jul 20 01:05:02 CEST 2007


On Thu, 19 Jul 2007 14:00:15 -0400 (EDT)
Bryan Loniewski wrote:

> We are looking into running bogofilter for 50,000+ users and would
> like to know of any issues with how we are planning on implementing
> it.
> 
> The interface most users will be using to train their wordlist is a
> webmail client (squirrelmail) that will provide a "spam" and "not
> spam" button depending on which folder the user is viewing.
> 
> We plan on building each users initial wordlist using messages from
> various sources. For the ham side we are going to use official email
> our users receive from various departments within the school (ie,
> president's office, human resouces, etc..) and several different
> mailing-lists (ie, unix-admin, pc_lan_admin, etc..). The spam side
> will come from a spam folder I've maintained for over a year now that
> contains ~18000 messages. Note that we only plan on training with
> the recommended minimum of 2000 ham, 2000 spam. After this is done we
> are going to run bogotune and potentially use the recommended
> settings.  Once complete we will test our wordlist and settings with
> several of our own accounts and see how well we did.
> 
> We plan on each user's wordlist being housed on NFS, outside their
> $HOME, so their quota is NOT affected by the size of the wordlist. We
> are concerned about the size of all wordlists combined, so we plan on
> reshuffling and purging old entries periodically (more on this below).
> 
> We run maildrop ( http://www.courier-mta.org/maildrop/ ) as our mail
> filter/mail delivery agent. Maildrop can read instructions from a
> file, which describe how to filter incoming mail (similar to procmail)
> and we plan on adding something like the following to the global
> /etc/maildroprc file:
> 
> xfilter "/usr/local/bin/bogofilter -u -e -p"
> if (/^X-Bogosity: Spam, tests=bogofilter/)
> {
>    to "$HOME/Maildir/.spam"
> }
> 
> Note that with the above recipe we are configuring bogofilter to
> update every user's wordlist automatically. Our logic for this is as
> follows:
> 
> 1) We don't trust our users to keep checking their spam folder
> for false positives, therefore, if we do NOT update for them they most
> likely will ONLY train their wordlist with false negatives.
> 
> 2) We plan on reaping entries older than X days (we are leaning
> towards X = 90/120 days) so that our initial wordlist each user
> started off with will eventually REALLY become their own wordlist.
> Again if we do NOT update for them, at some point in time they may
> never have any ham messages, thus making their wordlist useless.
> 
> We would appreciate any feedback the list can provide. Thanks.
> 
> _________________________
> Bryan Loniewski
> Rutgers University
> NBCS - Systems Programmer

Hello Bryan,

Sounds like you've given significant thought to this.  Good!

A couple of questions:  

What OS will squirrelmail be running on?  What database will bogofilter
be using? 

In the past there have been some problems with NFS locking not working
properly in all environments.  This is relevant when using Berkeley DB,
but does not affect sqlite3.

I'm sure other folks with more detailed networking knowledge will
contribute their thoughts.

HTH,

David



More information about the Bogofilter mailing list