Site wide vs Personal

Mon Jan 13 13:29:20 CET 2003

On 20030112 (Sun) at 2317:12 -0500, Tom Allison wrote:
> I while back I saw some threads on Site wide use versus personal use.
> 
> Was there a concensus on how effective bogofilter remains on a site-wide 
> implimentation where all the users share the same basic good/bad lists?

Been doing that for a while now.  I don't know if enough of us do it to
produce a consensus, but I can tell you what my experience has been. 

Currently my own single-user bogofilter is delivering around 2.5% false
negatives and I've had one false positive in the last 1200 or so
nonspams.  At work, where I've implemented bogofilter for about 80
users with just one training db, we've got the false positives down
below measurable as well -- zero in the last 2500 nonspams -- but we're
delivering about 12% of spams.  The problem is that we've got quite a
few users who take newsletter emails that look very spammy, so we have
to be a bit more stringent in classifying spam in order to avoid
quarantining legitimate email.  This lets some real spams get through.

This brings me to a very significant point: how well bogofilter
(Bayesian filtering in general) can do in a multiuser environment
depends inversely on the heterogeneity of user interests.  It's much
harder serving engineers _and_ marketing people _and_ purchasing agents
out of one training db than it would likely be if I ran a bogofilter
for each department.  Fortunately, my users are (generally) happy with
the reduction of 88% in the number of spams that hit them.  (The VP
Marketing, on the other hand, opted out of the spam filter; he's
uncomfortable with the whole concept ;)

So how well bogofilter does will vary a lot depending on who's in the
user base, how carefully the spam cutoff, nonspam cutoff, minimum
deviation, s and x parameters are chosen, and how carefully the
training is done (-u is a recipe for major disaster faster when there
are lots of users).  I review all the email personally (classify it
into good / bad / uncertain with bogofilter and then correct the
classification manually; 99.9% of the time it suffices to glance at the
subject line) before deciding what to use for training.  I have to do
that, because I get no help from most of the users; I have a mailbox
for spam collection, but only two or three of the users bounce the
spams that get through to them, and only I know what's uncertain.  I
train my db on errors and uncertains, and have been doing so ever since
I got to about 10,000 spams and 10,000 nonspams in the training corpus.

Oh, did I mention I use Robinson-Fisher exclusively?  It's a lot easier
to detect uncertainty with that approach than with Robinson-GM, and
both of them discriminate much more effectively (in my environment,
anyway) than the original Graham calculation.  Classifying in ternary
mode for training, as described above, is a very effective way to train
while minimizing time spent on manual reclassification.

What a long-winded way to say YMMV :)

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |