[bogofilter] spamitarium [was: using block_on_subnets]

Tom Anderson tanderso at oac-design.com
Fri Apr 30 15:49:32 CEST 2004


> How much effect does spamitarium have on system load - cpu usage, delay
> in delivering email, etc?

Virtually none.  You can pass it the "b" flag on the command line to test it
on your system... here's some results on my K6/256M server using an email
you previously posted to this list:

First without any processing of the header, just outputting the input:

$ ./spamitarium b < eml/5599
Return-Path: <turbotax at newsletter.turbotax.com>
Received: from mta1.primary.ddc.dartmail.net (mta1.primary.ddc.dartmail.net
[146.82.220.37]) by mail.osagesoftware.com (Postfix) with ESMTP id
2AF8E67A91 for <relson at osagesoftware.com>; Mon, 20 Oct 2003 20:16:35 -0400
(EDT)
Date: Mon, 20 Oct 2003 20:16:34 -0400 (EDT)
From: TurboTax <turbotax at newsletter.turbotax.com>
To: relson at osagesoftware.com
Message-Id: <Kilauea72347-9514-77179069-3 at flonetwork.com>
Subject: TurboTax Newsletter
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----00000000000000000000000000000000000000000000000000000000000000
0"
X-Original-To: relson at osagesoftware.com
X-MID: <Kilauea72347-9514-77179069-3 at flonetwork.com>
Delivered-To: relson at osagesoftware.com

Total running time was 0 wallclock secs; 0.01 usr + 0 sys = 0.01 CPU secs.

Then, with all of the options (parsing the received strings, checking for
validity, doing rDNS and ASN lookups, stripping non-standard fields):

$ ./spamitarium sreadb < eml/5599
Return-Path: <turbotax at newsletter.turbotax.com>
Received: from helo-mta1.primary.ddc.dartmail.net
mta1.primary.ddc.dartmail.net 146.82.220.37 as6432
          by mail.osagesoftware.com 216.144.204.42
          for <relson at osagesoftware.com>; Mon, 20 Oct 2003 20:16:35 -0400
(EDT)
Date: Mon, 20 Oct 2003 20:16:34 -0400 (EDT)
From: TurboTax <turbotax at newsletter.turbotax.com>
To: relson at osagesoftware.com
Message-Id: <Kilauea72347-9514-77179069-3 at flonetwork.com>
Subject: TurboTax Newsletter
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----00000000000000000000000000000000000000000000000000000000000000
0"

Total running time was 0 wallclock secs; 0.03 usr + 0.02 sys = 0.05 CPU
secs.

As you can see, the effect on system performance is negligible unless your
server is already overburdened.  BTW, in my wordlist, as6432 was seen twice
in hams and never in spams, therefore contributing to a hammy score for me.

Here's one of mine with many more received lines, first without processing:

$ ./spamitarium b < eml/citibank.eml
Return-Path: <support at citibank.com>
Received: from 213.210.179.114.adsl.nextra.cz
(213.210.179.114.adsl.nextra.cz [213.210.179.114]) by oac-design.com
(8.9.3/8.9.3) with SMTP id UAA10198 for <tanderso at oac-design.com>; Sat, 27
Mar 2004 20:58:44 -0500
Received: from hither-dns.faustcarcinoma.com ([112.40.200.204]) by
vd9-m39.hotmail.com with Microsoft SMTPSVC(5.0.2195.6824); Sun, 28 Mar 2004
07:00:03 +0500
Received: from mail.territorialmoan.com ([33.232.0.244]) by
derrick-dns.dignitaryconjure.com (8.15.5/6.94.1) with ESMTP id
b3EJbgz4789732 for <tanderso at oac-design.com>; Sun, 28 Mar 2004 03:06:03
+0100 (EST) (envelope-from support at citibank.com)
Received: from [16.112.224.20] (helo=winch9.semaphore72aerospace.com) by
mail.fluvialswig.com with esmtp (Exim 8.76) id 6YmtkB-9681gX-OG for
tanderso at oac-design.com; Sun, 28 Mar 2004 05:05:03 +0300
Received: from curiosity4.hop98chalk.com (localhost.cavern54convent.com
[127.0.0.1]) by nh9.hopkins72irvin.com (6.02.8s2/6.72.0) with ESMTP id
f0DPzVW0887634 for <tanderso at oac-design.com>; Sun, 28 Mar 2004
00:05:03 -0200 (CST) (envelope-from support at citibank.com)
Received: (from goodwin at localhost) by modem4.cranston45administrate.com
(3.04.2d3/3.85.0/directory) id d8NWfQA3863167; Sun, 28 Mar 2004 07:02:03
+0500 (CST)
Date: Sun, 28 Mar 2004 08:07:03 +0600 (CST)
From: "support at citibank.com" <support at citibank.com>
To: tanderso at oac-design.com
Message-Id: <761129479596.a4TZlEN64465582 at congruent5.bladder44donaldson.com>
Subject: Verify your E-mail with Citibank
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="--8529424399724186226"
X-Message-Info: PGSErHQ86iGxylqSFCLWoaRrgAIWag89
X-Evolution-Source: pop://tanderso@oac-design.com
X-UIDL: IM["!-2f!!W!J"!QXd"!
Status: U
X-Bogosity: No, tests=bogofilter, spamicity=0.050977, version=0.16.0

Total running time was 0 wallclock secs; 0.02 usr + 0 sys = 0.02 CPU secs.

Now, with all of the processing:

$ ./spamitarium sreadb < eml/citibank.eml
Return-Path: <support at citibank.com>
Received: from helo-213.210.179.114.adsl.nextra.cz
213.210.179.114.adsl.nextra.cz 213.210.179.114 as6721
          by oac-design.com 216.109.145.120
          for <tanderso at oac-design.com>; Sat, 27 Mar 2004 20:58:44 -0500
Received: untrusted
Received: untrusted
Received: untrusted
Date: Sun, 28 Mar 2004 08:07:03 +0600 (CST)
From: "support at citibank.com" <support at citibank.com>
To: tanderso at oac-design.com
Message-Id: <761129479596.a4TZlEN64465582 at congruent5.bladder44donaldson.com>
Subject: Verify your E-mail with Citibank
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="--8529424399724186226"

Total running time was 0 wallclock secs; 0.16 usr + 0.06 sys = 0.22 CPU
secs.

It took a little more time, but not much, and look at the reduced amount of
processing that bogofilter has to do as a result!  After the first received
line, none of the others followed the from/by chain.  My server
(oac-design.com) received the mail from 213.210.179.114, which identified
itself correctly as a DSL user in Czech.  The next line claimed to be
Hotmail, which is not a DSL user in Czech, therefore we do not want to use
that information to classify this message.  Instead, our token will be
"rcvd:untrusted" which, in my wordlist, was seen 1383 times in spams and 24
times in hams, with a Fisher value of 0.621936.

And take a look at this one... the spammer is claiming to be my own server!

$ ./spamitarium b < eml/forged.eml
Return-Path: <>
Received: from 216.109.145.120 ([203.81.198.30]) by oac-design.com
(8.9.3/8.9.3) with SMTP id SAA30997 for <tanderso at oac-design.com>; Thu, 25
Mar 2004 18:55:43 -0500
Received: from 156.76.249.101 by 203.81.198.30; Fri, 26 Mar 2004 02:55:54
+0300
Date: Fri, 26 Mar 2004 04:52:54 +0500
From: "dan" <MAILER-DAEMON at oac-design.com>
To: tanderso at oac-design.com
Message-ID: <BTFWZRTLYJCQBRKUDSKTO at czyokqofdfbbhwmmvlro@%RND_FRM_DOMAIN>
Subject: Where does your Website rank
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="--80575069425024471"
X-MSMail-Priority: Normal
X-Evolution-Source: pop://tanderso@oac-design.com
X-UIDL: $kg!!j\c"!T2@"!~"d"!
X-Priority: 3
Reply-To: "dan" <MAILER-DAEMON at oac-design.com>
X-Mailer:
X-Bogosity: No, tests=bogofilter, spamicity=0.001121, version=0.16.0

Total running time was 0 wallclock secs; 0.01 usr + 0 sys = 0.01 CPU secs.

But I'll set him straight:

$ ./spamitarium sreadb < eml/forged.eml
Return-Path: <>
Received: from helo-216.109.145.120 203.81.198.30 as17557
          by oac-design.com 216.109.145.120
          for <tanderso at oac-design.com>; Thu, 25 Mar 2004 18:55:43 -0500
Received: from 156.76.249.101 as6341
          by 203.81.198.30; Fri, 26 Mar 2004 02:55:54 +0300
Date: Fri, 26 Mar 2004 04:52:54 +0500
From: "dan" <MAILER-DAEMON at oac-design.com>
To: tanderso at oac-design.com
Message-ID: <BTFWZRTLYJCQBRKUDSKTO at czyokqofdfbbhwmmvlro@%RND_FRM_DOMAIN>
Subject: Where does your Website rank
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="--80575069425024471"

Total running time was 0 wallclock secs; 0.07 usr + 0.06 sys = 0.13 CPU
secs.

My server never says "helo" as "216.109.145.120", so the token
"rcvd:helo-216.109.145.120" has a Fisher value of 0.999984, seen in 526
spams and 0 hams.  Meanwhile, "rcvd:216.109.145.120" is seen equally in hams
and spams since I display the IP of the receiving server in every received
line.  Plus "rcvd:as17557" was seen in 4 spams and 0 hams, giving a value of
0.997873.

Given these vast improvements in classifications, I believe running
"spamitarium" on every email far outweighs the miniscule drain on system
resources.  Please let me know your results if you try it on your own
system.

Tom



More information about the Bogofilter mailing list