Bogofilter Best Practices?

Thomas Anderson tanderson at orderamidchaos.com
Tue Dec 15 23:06:00 CET 2009


Take a look at bogofilter-milter.pl as an example of a daemon.
http://stuff.mit.edu/~jik/software/bogofilter-milter/bogofilter-milter.pl.txt
http://orderamidchaos.com/bogofilter/bogofilter-milter (my custom version)

I think you can actually classify better just on the header rather than 
solely on the body, at least on the margin (i.e. those emails which 
don't classify easily either way).  I would try to capture at least the 
received lines in your database.  Much can be determined from them... 
e.g. see: http://orderamidchaos.com/bogofilter/spamitarium

Before getting too in-depth into your bogofilter config, I would first 
make sure you're weeding out as much noise as possible.  Take advantage 
of your SMTP features, including message size limits, 
multiple-connection throttling, recipient limits, timeouts, greet pause, 
reverse PTR lookup, and DNSBLs/RHSBLs.

# reject mail from IPs listed in DNSBLs
FEATURE(`dnsbl',`http.dnsbl.sorbs.net',`"554 Rejected. " $&{client_addr} 
" found in http.dnsbl.sorbs.net. Please correct your open proxy issue, 
and/or contact addressee through other means."')dnl
FEATURE(`dnsbl',`socks.dnsbl.sorbs.net',`"554 Rejected. " 
$&{client_addr} " found in socks.dnsbl.sorbs.net. Please correct your 
open proxy issue, and/or contact addressee through other means."')dnl
FEATURE(`dnsbl',`smtp.dnsbl.sorbs.net',`"554 Rejected. " $&{client_addr} 
" found in smtp.dnsbl.sorbs.net. Please correct your open proxy issue, 
and/or contact addressee through other means."')dnl
FEATURE(`dnsbl',`relays.visi.com',`"554 Rejected. " $&{client_addr} " 
found in relays.visi.com. Please correct your open relay problem, and/or 
contact addressee through other means."')dnl
FEATURE(`dnsbl',`sbl-xbl.spamhaus.org',`"554 Rejected. " $&{client_addr} 
" found in sbl-xbl.spamhaus.org. Please correct your Spamhaus 
designation as a spammer, and/or contact addressee through other means. 
  See http://www.abuse.net/sbl.phtml?IP=" $&{clientaddr} " for more 
information"')dnl
FEATURE(`dnsbl',`blackholes.easynet.nl', `"550 Mail from " 
$`'&{client_addr} " refused - see 
http://abuse.easynet.nl/blackholes.html"')dnl
FEATURE(`dnsbl',`blackholes.mail-abuse.org', `"550 Mail from " 
$`'&{client_addr} " refused - see http://mail-abuse.org/"')dnl

# reject mail from IPs listed in RHSBLs
FEATURE(`rhsbl',`dsn.rfc-ignorant.org',`"550 Mail from domain " 
$`'&{RHS} " refused. MX of domain do not accept bounces. This violates 
RFC 821/2505/2821 - see http://www.rfc-ignorant.org/"')dnl
FEATURE(`rhsbl',`bogusmx.rfc-ignorant.org',`"550 Mail from domain " 
$`'&{RHS} " refused. An MX for your domain is bogus - see 
http://www.rfc-ignorant.org/"')dnl
FEATURE(`rhsbl',`whois.rfc-ignorant.org',`"550 Mail from domain " 
$`'&{RHS} " refused. The WHOIS information is missing, incomplete, or 
incorrect - see http://www.rfc-ignorant.org/"')dnl

I think an interesting solution to your problem might be to set up one 
of your servers to run a training daemon and for the others to send 
training to it (e.g. see http://orderamidchaos.com/bogofilter/bfproxy). 
  Then you can have a single central wordlist which is shared for 
classification purposes (perhaps even on network storage).

Tom


Randy J. Ray wrote:
> Please bear with me through this, as it might ramble a bit here and there. I'm
> fairly new to using bogofilter, and the context in which I'm managing it is
> probably a little different than most users...
> 
> My company is using bogofilter as part of our fraud detection/interception
> strategy. I don't know how much detail I can go into, but I can say this:
> 
> * We manage emails between buyers and sellers on a classifieds-oriented site
> * We store email messages in a RDBMS, minimal header information but we store
>   complete bodies (HTML is stripped out and non-HTML MIME parts are removed)
> * Each message is evaluated with bogofilter, those with a high-enough or
>   low-enough score to be unambiguously spam/ham are marked as such and handled
> * Those that are in the mid-band get reviewed by people we contract out to,
>   and hand-classified as one or the other
> * Nightly, the known-ham and known-spam are used to create the word-lists,
>   which are distributed across our servers
> 
> Because the machines that handle the incoming mail are distributed, we don't do
> automatic classification; that is, we don't update the word-lists during
> classification because the updates wouldn't get shared. That's why we do it
> nightly via jobs that run out of cron.
> 
> That's also the problem we're running in to: the number of messages we have to
> train against has gotten so large that it is taking a *really* long time to
> generate the word-lists. Since the messages are in the RDBMS, and not mbox
> files, we end up doing shell-execution of bogofilter once per message to be
> trained (don't blame me, this was written before I got here!). To be plain,
> it's taking forever to generate the files. It's a mess, and I have the dubious
> honor of trying to improve the process.
> 
> I've been trying to look into ways to reduce the number of times we execute the
> bogofilter binary itself. It seems that I could emulate a sort of "server" with
> the -b option, writing each message to a temp file then feeding the file name
> to STDIN. But that appears to be geared towards message classification, not
> spam/ham training. I'm looking at the possibility of creating fake mbox files
> from the messages, but I have next to no header information for the messages,
> just the bodies. I'm not sure how well that would work.
> 
> If anyone has any experiences similar to this, or any thoughts or ideas to
> share, I'd be happy to hear them. I've been approaching this with an eye
> towards encapsulating some of the functionality into a Perl module. It seems
> that the classifier-(pseudo)-daemon would be easy-enough to do, but that
> wouldn't help my problem. We get good-enough performance and throughput on the
> actual classification of incoming messages. It's the creation of the word-list
> files from our (growing) corpus that is driving me nuts.
> 
> Randy




More information about the Bogofilter mailing list