bogofilter setup for 50000+ users

Bryan Loniewski brylon at jla.rutgers.edu
Thu Jul 19 20:00:15 CEST 2007


We are looking into running bogofilter for 50,000+ users and would
like to know of any issues with how we are planning on implementing
it.

The interface most users will be using to train their wordlist is a
webmail client (squirrelmail) that will provide a "spam" and "not
spam" button depending on which folder the user is viewing.

We plan on building each users initial wordlist using messages from
various sources. For the ham side we are going to use official email
our users receive from various departments within the school (ie,
president's office, human resouces, etc..) and several different
mailing-lists (ie, unix-admin, pc_lan_admin, etc..). The spam side
will come from a spam folder I've maintained for over a year now that
contains ~18000 messages. Note that we only plan on training with
the recommended minimum of 2000 ham, 2000 spam. After this is done we
are going to run bogotune and potentially use the recommended
settings.  Once complete we will test our wordlist and settings with
several of our own accounts and see how well we did.

We plan on each user's wordlist being housed on NFS, outside their
$HOME, so their quota is NOT affected by the size of the wordlist. We
are concerned about the size of all wordlists combined, so we plan on
reshuffling and purging old entries periodically (more on this below).

We run maildrop ( http://www.courier-mta.org/maildrop/ ) as our mail
filter/mail delivery agent. Maildrop can read instructions from a
file, which describe how to filter incoming mail (similar to procmail)
and we plan on adding something like the following to the global
/etc/maildroprc file:

xfilter "/usr/local/bin/bogofilter -u -e -p"
if (/^X-Bogosity: Spam, tests=bogofilter/)
{
   to "$HOME/Maildir/.spam"
}

Note that with the above recipe we are configuring bogofilter to
update every user's wordlist automatically. Our logic for this is as
follows:

1) We don't trust our users to keep checking their spam folder
for false positives, therefore, if we do NOT update for them they most
likely will ONLY train their wordlist with false negatives.

2) We plan on reaping entries older than X days (we are leaning
towards X = 90/120 days) so that our initial wordlist each user
started off with will eventually REALLY become their own wordlist.
Again if we do NOT update for them, at some point in time they may
never have any ham messages, thus making their wordlist useless.

We would appreciate any feedback the list can provide. Thanks.

_________________________
Bryan Loniewski
Rutgers University
NBCS - Systems Programmer



More information about the Bogofilter mailing list