bogofilter setup for 50000+ users

Bryan Loniewski brylon at jla.rutgers.edu
Fri Jul 20 22:02:37 CEST 2007


On Thu, 19 Jul 2007, David Relson wrote:

> On Thu, 19 Jul 2007 14:00:15 -0400 (EDT)
> Bryan Loniewski wrote:
>
>> We are looking into running bogofilter for 50,000+ users and would
>> like to know of any issues with how we are planning on implementing
>> it.
>>
>> The interface most users will be using to train their wordlist is a
>> webmail client (squirrelmail) that will provide a "spam" and "not
>> spam" button depending on which folder the user is viewing.
>>
>> We plan on building each users initial wordlist using messages from
>> various sources. For the ham side we are going to use official email
>> our users receive from various departments within the school (ie,
>> president's office, human resouces, etc..) and several different
>> mailing-lists (ie, unix-admin, pc_lan_admin, etc..). The spam side
>> will come from a spam folder I've maintained for over a year now that
>> contains ~18000 messages. Note that we only plan on training with
>> the recommended minimum of 2000 ham, 2000 spam. After this is done we
>> are going to run bogotune and potentially use the recommended
>> settings.  Once complete we will test our wordlist and settings with
>> several of our own accounts and see how well we did.
>>
>> We plan on each user's wordlist being housed on NFS, outside their
>> $HOME, so their quota is NOT affected by the size of the wordlist. We
>> are concerned about the size of all wordlists combined, so we plan on
>> reshuffling and purging old entries periodically (more on this below).
>>
>> We run maildrop ( http://www.courier-mta.org/maildrop/ ) as our mail
>> filter/mail delivery agent. Maildrop can read instructions from a
>> file, which describe how to filter incoming mail (similar to procmail)
>> and we plan on adding something like the following to the global
>> /etc/maildroprc file:
>>
>> xfilter "/usr/local/bin/bogofilter -u -e -p"
>> if (/^X-Bogosity: Spam, tests=bogofilter/)
>> {
>>    to "$HOME/Maildir/.spam"
>> }
>>
>> Note that with the above recipe we are configuring bogofilter to
>> update every user's wordlist automatically. Our logic for this is as
>> follows:
>>
>> 1) We don't trust our users to keep checking their spam folder
>> for false positives, therefore, if we do NOT update for them they most
>> likely will ONLY train their wordlist with false negatives.
>>
>> 2) We plan on reaping entries older than X days (we are leaning
>> towards X = 90/120 days) so that our initial wordlist each user
>> started off with will eventually REALLY become their own wordlist.
>> Again if we do NOT update for them, at some point in time they may
>> never have any ham messages, thus making their wordlist useless.
>>
>> We would appreciate any feedback the list can provide. Thanks.
>>
>> _________________________
>> Bryan Loniewski
>> Rutgers University
>> NBCS - Systems Programmer
>
> Hello Bryan,
>
> Sounds like you've given significant thought to this.  Good!

We are trying ;)

>
> A couple of questions:
>
> What OS will squirrelmail be running on?  What database will bogofilter
> be using?

We run squirrelmail on Solaris 9 boxes. We planned on using Berkeley
DB (4.2.52).

>
> In the past there have been some problems with NFS locking not working
> properly in all environments.  This is relevant when using Berkeley DB,
> but does not affect sqlite3.

So would you recommend we use sqlite3? Do we lose/gain anything by
using sqlite3 vs. Berkeley?

>
> I'm sure other folks with more detailed networking knowledge will
> contribute their thoughts.
>
> HTH,
>
> David
>



More information about the Bogofilter mailing list