bogofilter setup for 50000+ users

dave-bogofilter at homer.cymry.org dave-bogofilter at homer.cymry.org
Fri Jul 20 23:44:35 CEST 2007


Hi Bryan

Always nice to hear about large bogofilter installs. 

Although this probably sounds like crazy talk, you might consider
trying a single wordlist for your entire user population.  We did this
for a population of around 2500 users with a very early version of bf
with mindblowing results. See our implementation paper (including
training scripts we used etc..) here:
http://www.usenix.org/events/lisa04/tech/blosser.html
There's anecdotal evidence that this has scaled well to environments of
your size. 

I suggest this simply because bogofilter's decision making prowess in my
experience improves linerally with the amount of data it's given. A
single wordlist per user is not a lot of data probably, but a 50,000
user wordlist is probably as near to perfect as you can get imo.

Of course that complicates the training arrangement you have in mind. 

--dave josephsen.

On Fri, Jul 20, 2007 at 04:02:37PM -0400, Bryan Loniewski wrote:
> On Thu, 19 Jul 2007, David Relson wrote:
> 
> > On Thu, 19 Jul 2007 14:00:15 -0400 (EDT)
> > Bryan Loniewski wrote:
> >
> >> We are looking into running bogofilter for 50,000+ users and would
> >> like to know of any issues with how we are planning on implementing
> >> it.
> >>
> >> The interface most users will be using to train their wordlist is a
> >> webmail client (squirrelmail) that will provide a "spam" and "not
> >> spam" button depending on which folder the user is viewing.
> >>
> >> We plan on building each users initial wordlist using messages from
> >> various sources. For the ham side we are going to use official email
> >> our users receive from various departments within the school (ie,
> >> president's office, human resouces, etc..) and several different
> >> mailing-lists (ie, unix-admin, pc_lan_admin, etc..). The spam side
> >> will come from a spam folder I've maintained for over a year now that
> >> contains ~18000 messages. Note that we only plan on training with
> >> the recommended minimum of 2000 ham, 2000 spam. After this is done we
> >> are going to run bogotune and potentially use the recommended
> >> settings.  Once complete we will test our wordlist and settings with
> >> several of our own accounts and see how well we did.
> >>
> >> We plan on each user's wordlist being housed on NFS, outside their
> >> $HOME, so their quota is NOT affected by the size of the wordlist. We
> >> are concerned about the size of all wordlists combined, so we plan on
> >> reshuffling and purging old entries periodically (more on this below).
> >>
> >> We run maildrop ( http://www.courier-mta.org/maildrop/ ) as our mail
> >> filter/mail delivery agent. Maildrop can read instructions from a
> >> file, which describe how to filter incoming mail (similar to procmail)
> >> and we plan on adding something like the following to the global
> >> /etc/maildroprc file:
> >>
> >> xfilter "/usr/local/bin/bogofilter -u -e -p"
> >> if (/^X-Bogosity: Spam, tests=bogofilter/)
> >> {
> >>    to "$HOME/Maildir/.spam"
> >> }
> >>
> >> Note that with the above recipe we are configuring bogofilter to
> >> update every user's wordlist automatically. Our logic for this is as
> >> follows:
> >>
> >> 1) We don't trust our users to keep checking their spam folder
> >> for false positives, therefore, if we do NOT update for them they most
> >> likely will ONLY train their wordlist with false negatives.
> >>
> >> 2) We plan on reaping entries older than X days (we are leaning
> >> towards X = 90/120 days) so that our initial wordlist each user
> >> started off with will eventually REALLY become their own wordlist.
> >> Again if we do NOT update for them, at some point in time they may
> >> never have any ham messages, thus making their wordlist useless.
> >>
> >> We would appreciate any feedback the list can provide. Thanks.
> >>
> >> _________________________
> >> Bryan Loniewski
> >> Rutgers University
> >> NBCS - Systems Programmer
> >
> > Hello Bryan,
> >
> > Sounds like you've given significant thought to this.  Good!
> 
> We are trying ;)
> 
> >
> > A couple of questions:
> >
> > What OS will squirrelmail be running on?  What database will bogofilter
> > be using?
> 
> We run squirrelmail on Solaris 9 boxes. We planned on using Berkeley
> DB (4.2.52).
> 
> >
> > In the past there have been some problems with NFS locking not working
> > properly in all environments.  This is relevant when using Berkeley DB,
> > but does not affect sqlite3.
> 
> So would you recommend we use sqlite3? Do we lose/gain anything by
> using sqlite3 vs. Berkeley?
> 
> >
> > I'm sure other folks with more detailed networking knowledge will
> > contribute their thoughts.
> >
> > HTH,
> >
> > David
> >
> _______________________________________________
> Bogofilter mailing list
> Bogofilter at bogofilter.org
> http://www.bogofilter.org/mailman/listinfo/bogofilter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20070720/3bc7bc32/attachment.sig>


More information about the Bogofilter mailing list