multi-user [was: New Release: Bogofilter 1.0.0]

Greg Louis glouis at dynamicro.on.ca
Wed Dec 28 13:09:03 CET 2005


On 20051227 (Tue) at 2317:45 -0500, David Relson wrote:
> On Tue, 27 Dec 2005 19:48:44 -0500
> Tom Allison wrote:

> > I guess what I'm not clear on is how you would run that many accounts on a mail 
> > server today.  The typical approach for procmail is to have it configured for 
> > each user (passwd) on the system.  But for 60,000 email accounts (or any large 
> > number) you don't want them all to have their own account but use virtual 
> > accounts.  Under that approach I'm not sure how you would configure 
> > procmail/bogofilter to play nice (even if you did use individual bogofilter 
> > wordlists).
> 
> Hi Tom,
> 
> I don't recall the details :-<
> 
> Google for "bogofilter +toronto +university" and you'll find relevant material.
> 
>From a Usenix paper by Jeremy Blosser, timeframe September 2004:

        York University in Toronto recently deployed Bogofilter as a
classifier for their environment of 60,000 user accounts.  Incoming
mail volume is on the order of hundreds of thousands of messages per
day.  At the time of writing, this implementation was too new to have
reliable numbers available, but early results are promising.  In this
implementation messages pass through Bogofilter after DCC has already
scanned them and rejected spams it can detect; approximately 30-40% of
this remaining incoming mail is initially being flagged as spam by
Bogofilter.  This is an order of magnitude higher than the block rate
of the SpamAssassin implementation that Bogofilter replaced, with a
dramatically lower rate of false positives reported.

        A large ISP in Australia is using a modified version of
Bogofilter with a single wordlist to watch 150,000 mailboxes.  Over 1
million messages are processed per day.  Bogofilter is believed to be
around 95% effective in this environment, with no false positives
reported in 6 months of operation.  The wordlist management is
completely centralized, with no user input whatsoever.  Administrators
keep Bogofilter's training current by manually scanning and training on
random samplings of 100-300 "unsure" emails per week.

End quote.

>From an August 2004 letter by Youk U's CNS Information Manager to
Jeremy:

[...] filtering for 60000 student accounts was activated yesterday. 
They are all using the same bogofilter database.  I should have some
stats next week.  We have just turned off spamassassin but we had it
active parallel to bogofilter until now.  We are not rejecting at
connection now, just marking and have a global procmail rule to dump
spam in a user folder which is periodically cleaned.  We will probably
start rejecting at connection once we have built up a bit more
confidence.

So far the user response has been overwhelmingly positive.  People are

saying "email is useful again!" and such.  They have also remarked how
they no longer have to look through their spam folder since there is
never any good mail in it - spamassassin regularly tagged some good
mail.

I have to say that getting to this point wasn't easy - and there is
still a ways to go [...] the fact that the solution survived is a
testament to its merits far more than my negotiating skill... it simply
works.

End quote.

I met with the York U team on Sept. 8, 2004.  They were still happy and
enthusiastic, and looking forward to their first major statistical
summaries.  Unfortunately, we didn't follow up on that, so I never saw
the actual performance figures.

Individual databases for that scale of user community isn't a practical
idea; the unexpected and delightful observation was -- and we've all
seen it who've tried it -- that one database works extremely well even
for a widely disparate user population, and even with limited (though
careful) training.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |



More information about the Bogofilter mailing list