Bogofilter Best Practices?

Tue Dec 8 09:45:35 CET 2009

Am 08.12.2009, 02:00 Uhr, schrieb Randy J. Ray <rjray at blackperl.com>:

> That's also the problem we're running in to: the number of messages we  
> have to
> train against has gotten so large that it is taking a *really* long time  
> to
> generate the word-lists. Since the messages are in the RDBMS, and not  
> mbox
> files, we end up doing shell-execution of bogofilter once per message to  
> be
> trained (don't blame me, this was written before I got here!). To be  
> plain,
> it's taking forever to generate the files. It's a mess, and I have the  
> dubious
> honor of trying to improve the process.

Sorry for the awkward quoting (reported against Opera 9.X and still  
unfixed as of 10.10).

Could you dump a set of message either to a Maildir or MH format  
directory, or to an mbox, and then run batch training on the directory?  
That would allow you to have one bogofilter execution for the whole  
folder. bogofilter's -M and perhaps -b option could help here.

> I've been trying to look into ways to reduce the number of times we  
> execute the
> bogofilter binary itself. It seems that I could emulate a sort of  
> "server" with
> the -b option, writing each message to a temp file then feeding the file  
> name
> to STDIN. But that appears to be geared towards message classification,  
> not
> spam/ham training. I'm looking at the possibility of creating fake mbox  
> files
> from the messages, but I have next to no header information for the  
> messages,
> just the bodies. I'm not sure how well that would work.

You can tell bogofilter to treat everything as the body (-H option), other  
than that it shouldn't matter if you use it for classification or for  
training, combining -b with -s or -n is possible - the difference is only  
in how bogofilter deals with the database and what it prints as its  
output, will it just read the database (without -s/-n/-u), or will it  
update it.

> If anyone has any experiences similar to this, or any thoughts or ideas  
> to
> share, I'd be happy to hear them. I've been approaching this with an eye
> towards encapsulating some of the functionality into a Perl module. It  
> seems
> that the classifier-(pseudo)-daemon would be easy-enough to do, but that
> wouldn't help my problem. We get good-enough performance and throughput  
> on the
> actual classification of incoming messages. It's the creation of the  
> word-list
> files from our (growing) corpus that is driving me nuts.

The alternative would be if you have a reference database master for your  
message RDBMS, in that situation you could consider updating a reference  
wordlist.db when the RDBMS gets updated (providing you can change your  
RDBMS frontends like that) and then share the wordlist.db (either directly  
if that works for your setup, i. e. database library versions and possibly  
endian/word width) or by means of dumping/reloading with bogoutil.

-- 
Matthias Andree