best practices question

Sat Sep 21 01:49:44 CEST 2002

On Sep 20, David Relson [relson at osagesoftware.com] wrote:
> Until I saw your posting about stress testing, the thought of adaptive use 
> hadn't occurred to me.

Most of our spam blocking currently happens via Vipul's Razor.  Once we've
tested bogofilter to our satisfaction, our implementation plan looks like:

- Continue to run everything through Vipul's, and use its opinion of a mail
  to train bogofilter:

  if (razor->is_spam(msg))
     bogofilter -s
     drop msg
  else
     bogofilter -h
     deliver msg

- Once this has created a bogofilter db of sufficient size, ask bogofilter
  its opinion as well, and if the two don't agree, store the headers:

  bf = bogofilter->is_spam(msg)
  if (r = razor->is_spam(msg))
     bogofilter -s
     drop msg
  else
     bogofilter -h
     deliver msg
  if (r != bf)
     store_headers()

  we'll then look at a sample of these and hopefully find that the
  difference is because bogofilter is a lot smarter than Vipul's.

- Assuming that is indeed what we find, we'll switch to letting bogofilter
  drive:

  if (bogofilter->is_spam(msg))
     bogofilter -s
     drop msg
  else
     bogofilter -h
     deliver msg

  and tell our users to let us know of any misclassifications so we can
  retrain.  That'll be the hard part, and the one that'll only work if
  these algorithms really are good enough to keep the misclassifications to
  a low number.

Obviously the original paper on this theory spoke of training it per user,
but that's just not an option in an org like ours, where the users are
telling IT "we're paying you to deal with this spam so we don't have to".
Hopefully it'll work in this environment as well.  Results so far are
positive; our spam is pretty heterogenous, same with our legit mail.

For summay digest subscription: bogofilter-digest-subscribe at aotto.com