Bogofilter Best Practices?

Tue Dec 8 02:00:05 CET 2009

Please bear with me through this, as it might ramble a bit here and there. I'm
fairly new to using bogofilter, and the context in which I'm managing it is
probably a little different than most users...

My company is using bogofilter as part of our fraud detection/interception
strategy. I don't know how much detail I can go into, but I can say this:

* We manage emails between buyers and sellers on a classifieds-oriented site
* We store email messages in a RDBMS, minimal header information but we store
  complete bodies (HTML is stripped out and non-HTML MIME parts are removed)
* Each message is evaluated with bogofilter, those with a high-enough or
  low-enough score to be unambiguously spam/ham are marked as such and handled
* Those that are in the mid-band get reviewed by people we contract out to,
  and hand-classified as one or the other
* Nightly, the known-ham and known-spam are used to create the word-lists,
  which are distributed across our servers

Because the machines that handle the incoming mail are distributed, we don't do
automatic classification; that is, we don't update the word-lists during
classification because the updates wouldn't get shared. That's why we do it
nightly via jobs that run out of cron.

That's also the problem we're running in to: the number of messages we have to
train against has gotten so large that it is taking a *really* long time to
generate the word-lists. Since the messages are in the RDBMS, and not mbox
files, we end up doing shell-execution of bogofilter once per message to be
trained (don't blame me, this was written before I got here!). To be plain,
it's taking forever to generate the files. It's a mess, and I have the dubious
honor of trying to improve the process.

I've been trying to look into ways to reduce the number of times we execute the
bogofilter binary itself. It seems that I could emulate a sort of "server" with
the -b option, writing each message to a temp file then feeding the file name
to STDIN. But that appears to be geared towards message classification, not
spam/ham training. I'm looking at the possibility of creating fake mbox files
from the messages, but I have next to no header information for the messages,
just the bodies. I'm not sure how well that would work.

If anyone has any experiences similar to this, or any thoughts or ideas to
share, I'd be happy to hear them. I've been approaching this with an eye
towards encapsulating some of the functionality into a Perl module. It seems
that the classifier-(pseudo)-daemon would be easy-enough to do, but that
wouldn't help my problem. We get good-enough performance and throughput on the
actual classification of incoming messages. It's the creation of the word-list
files from our (growing) corpus that is driving me nuts.

Randy
-- 
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Randy J. Ray      Sunnyvale, CA      http://www.rjray.org   rjray at blackperl.com

Silicon Valley Scale Modelers: http://www.svsm.org