bogofilter setup for 50000+ users

dhottinger at harrisonburg.k12.va.us dhottinger at harrisonburg.k12.va.us
Sat Jul 21 00:14:34 CEST 2007


Quoting dave-bogofilter at homer.cymry.org:

> Hi Bryan
>
> Always nice to hear about large bogofilter installs.
>
> Although this probably sounds like crazy talk, you might consider
> trying a single wordlist for your entire user population.  We did this
> for a population of around 2500 users with a very early version of bf
> with mindblowing results. See our implementation paper (including
> training scripts we used etc..) here:
> http://www.usenix.org/events/lisa04/tech/blosser.html
> There's anecdotal evidence that this has scaled well to environments of
> your size.
>
> I suggest this simply because bogofilter's decision making prowess in my
> experience improves linerally with the amount of data it's given. A
> single wordlist per user is not a lot of data probably, but a 50,000
> user wordlist is probably as near to perfect as you can get imo.
>
> Of course that complicates the training arrangement you have in mind.
>
> --dave josephsen.
>
> On Fri, Jul 20, 2007 at 04:02:37PM -0400, Bryan Loniewski wrote:
>> On Thu, 19 Jul 2007, David Relson wrote:
>>
>> > On Thu, 19 Jul 2007 14:00:15 -0400 (EDT)
>> > Bryan Loniewski wrote:
>> >
>> >> We are looking into running bogofilter for 50,000+ users and would
>> >> like to know of any issues with how we are planning on implementing
>> >> it.
>> >>
>> >> The interface most users will be using to train their wordlist is a
>> >> webmail client (squirrelmail) that will provide a "spam" and "not
>> >> spam" button depending on which folder the user is viewing.
>> >>
>> >> We plan on building each users initial wordlist using messages from
>> >> various sources. For the ham side we are going to use official email
>> >> our users receive from various departments within the school (ie,
>> >> president's office, human resouces, etc..) and several different
>> >> mailing-lists (ie, unix-admin, pc_lan_admin, etc..). The spam side
>> >> will come from a spam folder I've maintained for over a year now that
>> >> contains ~18000 messages. Note that we only plan on training with
>> >> the recommended minimum of 2000 ham, 2000 spam. After this is done we
>> >> are going to run bogotune and potentially use the recommended
>> >> settings.  Once complete we will test our wordlist and settings with
>> >> several of our own accounts and see how well we did.
>> >>
>> >> We plan on each user's wordlist being housed on NFS, outside their
>> >> $HOME, so their quota is NOT affected by the size of the wordlist. We
>> >> are concerned about the size of all wordlists combined, so we plan on
>> >> reshuffling and purging old entries periodically (more on this below).
>> >>
>> >> We run maildrop ( http://www.courier-mta.org/maildrop/ ) as our mail
>> >> filter/mail delivery agent. Maildrop can read instructions from a
>> >> file, which describe how to filter incoming mail (similar to procmail)
>> >> and we plan on adding something like the following to the global
>> >> /etc/maildroprc file:
>> >>
>> >> xfilter "/usr/local/bin/bogofilter -u -e -p"
>> >> if (/^X-Bogosity: Spam, tests=bogofilter/)
>> >> {
>> >>    to "$HOME/Maildir/.spam"
>> >> }
>> >>
>> >> Note that with the above recipe we are configuring bogofilter to
>> >> update every user's wordlist automatically. Our logic for this is as
>> >> follows:
>> >>
>> >> 1) We don't trust our users to keep checking their spam folder
>> >> for false positives, therefore, if we do NOT update for them they most
>> >> likely will ONLY train their wordlist with false negatives.
>> >>
>> >> 2) We plan on reaping entries older than X days (we are leaning
>> >> towards X = 90/120 days) so that our initial wordlist each user
>> >> started off with will eventually REALLY become their own wordlist.
>> >> Again if we do NOT update for them, at some point in time they may
>> >> never have any ham messages, thus making their wordlist useless.
>> >>
>> >> We would appreciate any feedback the list can provide. Thanks.
>> >>
>> >> _________________________
>> >> Bryan Loniewski
>> >> Rutgers University
>> >> NBCS - Systems Programmer
>> >
>> > Hello Bryan,
>> >
>> > Sounds like you've given significant thought to this.  Good!
>>
>> We are trying ;)
>>
>> >
>> > A couple of questions:
>> >
>> > What OS will squirrelmail be running on?  What database will bogofilter
>> > be using?
>>
>> We run squirrelmail on Solaris 9 boxes. We planned on using Berkeley
>> DB (4.2.52).
>>
>> >
>> > In the past there have been some problems with NFS locking not working
>> > properly in all environments.  This is relevant when using Berkeley DB,
>> > but does not affect sqlite3.
>>
>> So would you recommend we use sqlite3? Do we lose/gain anything by
>> using sqlite3 vs. Berkeley?
>>
>> >
>> > I'm sure other folks with more detailed networking knowledge will
>> > contribute their thoughts.
>> >
>> > HTH,
>> >
>> > David
>> >
>> _______________________________________________
>> Bogofilter mailing list
>> Bogofilter at bogofilter.org
>> http://www.bogofilter.org/mailman/listinfo/bogofilter
>

Ive been following this thread pretty close.  I use bogofilter with a  
single wordlist for 750+ users and have been thinking of going to a  
wordlist for each user.  We use horde webmail with a report as spam,  
and report as innocent link.  Mail then goes to a mailbox on the  
mailserver, and I look through them on a weekly (or more) basis just  
to make sure that one of our administrators, or me ;-] havent been  
reported as spam.  Then I import these emails into the wordlist.  This  
works well.  All emails flagged as spam go into a spam mailbox that  
gets compressed and rotated every week.   If a user was expecting an  
important work related email, I can pull them out and forward them to  
the user.  Then the false emails are fed back into bogofilter as good  
mail.  My experience has been that users more often than not report  
non-spam as spam.  Hence the reason for not letting each user manage  
their own spam.  However, this can become cumbersome when I have to  
determine what is really spam, or find emails that have been flagged  
as spam.  I am most interested to see which direction you go with  
50,000 users.

-- 
Dwayne Hottinger
Network Administrator
Harrisonburg City Public Schools




More information about the Bogofilter mailing list