Bogofilter Best Practices?
Thomas Anderson
tanderson at orderamidchaos.com
Tue Dec 15 23:06:00 CET 2009
Take a look at bogofilter-milter.pl as an example of a daemon.
http://stuff.mit.edu/~jik/software/bogofilter-milter/bogofilter-milter.pl.txt
http://orderamidchaos.com/bogofilter/bogofilter-milter (my custom version)
I think you can actually classify better just on the header rather than
solely on the body, at least on the margin (i.e. those emails which
don't classify easily either way). I would try to capture at least the
received lines in your database. Much can be determined from them...
e.g. see: http://orderamidchaos.com/bogofilter/spamitarium
Before getting too in-depth into your bogofilter config, I would first
make sure you're weeding out as much noise as possible. Take advantage
of your SMTP features, including message size limits,
multiple-connection throttling, recipient limits, timeouts, greet pause,
reverse PTR lookup, and DNSBLs/RHSBLs.
# reject mail from IPs listed in DNSBLs
FEATURE(`dnsbl',`http.dnsbl.sorbs.net',`"554 Rejected. " $&{client_addr}
" found in http.dnsbl.sorbs.net. Please correct your open proxy issue,
and/or contact addressee through other means."')dnl
FEATURE(`dnsbl',`socks.dnsbl.sorbs.net',`"554 Rejected. "
$&{client_addr} " found in socks.dnsbl.sorbs.net. Please correct your
open proxy issue, and/or contact addressee through other means."')dnl
FEATURE(`dnsbl',`smtp.dnsbl.sorbs.net',`"554 Rejected. " $&{client_addr}
" found in smtp.dnsbl.sorbs.net. Please correct your open proxy issue,
and/or contact addressee through other means."')dnl
FEATURE(`dnsbl',`relays.visi.com',`"554 Rejected. " $&{client_addr} "
found in relays.visi.com. Please correct your open relay problem, and/or
contact addressee through other means."')dnl
FEATURE(`dnsbl',`sbl-xbl.spamhaus.org',`"554 Rejected. " $&{client_addr}
" found in sbl-xbl.spamhaus.org. Please correct your Spamhaus
designation as a spammer, and/or contact addressee through other means.
See http://www.abuse.net/sbl.phtml?IP=" $&{clientaddr} " for more
information"')dnl
FEATURE(`dnsbl',`blackholes.easynet.nl', `"550 Mail from "
$`'&{client_addr} " refused - see
http://abuse.easynet.nl/blackholes.html"')dnl
FEATURE(`dnsbl',`blackholes.mail-abuse.org', `"550 Mail from "
$`'&{client_addr} " refused - see http://mail-abuse.org/"')dnl
# reject mail from IPs listed in RHSBLs
FEATURE(`rhsbl',`dsn.rfc-ignorant.org',`"550 Mail from domain "
$`'&{RHS} " refused. MX of domain do not accept bounces. This violates
RFC 821/2505/2821 - see http://www.rfc-ignorant.org/"')dnl
FEATURE(`rhsbl',`bogusmx.rfc-ignorant.org',`"550 Mail from domain "
$`'&{RHS} " refused. An MX for your domain is bogus - see
http://www.rfc-ignorant.org/"')dnl
FEATURE(`rhsbl',`whois.rfc-ignorant.org',`"550 Mail from domain "
$`'&{RHS} " refused. The WHOIS information is missing, incomplete, or
incorrect - see http://www.rfc-ignorant.org/"')dnl
I think an interesting solution to your problem might be to set up one
of your servers to run a training daemon and for the others to send
training to it (e.g. see http://orderamidchaos.com/bogofilter/bfproxy).
Then you can have a single central wordlist which is shared for
classification purposes (perhaps even on network storage).
Tom
Randy J. Ray wrote:
> Please bear with me through this, as it might ramble a bit here and there. I'm
> fairly new to using bogofilter, and the context in which I'm managing it is
> probably a little different than most users...
>
> My company is using bogofilter as part of our fraud detection/interception
> strategy. I don't know how much detail I can go into, but I can say this:
>
> * We manage emails between buyers and sellers on a classifieds-oriented site
> * We store email messages in a RDBMS, minimal header information but we store
> complete bodies (HTML is stripped out and non-HTML MIME parts are removed)
> * Each message is evaluated with bogofilter, those with a high-enough or
> low-enough score to be unambiguously spam/ham are marked as such and handled
> * Those that are in the mid-band get reviewed by people we contract out to,
> and hand-classified as one or the other
> * Nightly, the known-ham and known-spam are used to create the word-lists,
> which are distributed across our servers
>
> Because the machines that handle the incoming mail are distributed, we don't do
> automatic classification; that is, we don't update the word-lists during
> classification because the updates wouldn't get shared. That's why we do it
> nightly via jobs that run out of cron.
>
> That's also the problem we're running in to: the number of messages we have to
> train against has gotten so large that it is taking a *really* long time to
> generate the word-lists. Since the messages are in the RDBMS, and not mbox
> files, we end up doing shell-execution of bogofilter once per message to be
> trained (don't blame me, this was written before I got here!). To be plain,
> it's taking forever to generate the files. It's a mess, and I have the dubious
> honor of trying to improve the process.
>
> I've been trying to look into ways to reduce the number of times we execute the
> bogofilter binary itself. It seems that I could emulate a sort of "server" with
> the -b option, writing each message to a temp file then feeding the file name
> to STDIN. But that appears to be geared towards message classification, not
> spam/ham training. I'm looking at the possibility of creating fake mbox files
> from the messages, but I have next to no header information for the messages,
> just the bodies. I'm not sure how well that would work.
>
> If anyone has any experiences similar to this, or any thoughts or ideas to
> share, I'd be happy to hear them. I've been approaching this with an eye
> towards encapsulating some of the functionality into a Perl module. It seems
> that the classifier-(pseudo)-daemon would be easy-enough to do, but that
> wouldn't help my problem. We get good-enough performance and throughput on the
> actual classification of incoming messages. It's the creation of the word-list
> files from our (growing) corpus that is driving me nuts.
>
> Randy
More information about the Bogofilter
mailing list