[bogofilter] Re: headers
Tom Anderson
tanderso at oac-design.com
Tue Apr 6 16:05:46 CEST 2004
This bounced the first time around...
On Sat, 2004-04-03 at 04:37, Tom Anderson wrote:
> As I promised previously, I've built a program which will manipulate
> email headers favorably before filtering. It's called "spamitarium",
> since it's where spams go to get their heads fixed ;)
>
> There are a number of options I've included. The most important
> (causing the best results) is only allowing standard (RFC 2822, et al)
> header lines. Passing the "s" parameter does this. All other (X-, etc)
> lines are stripped. The second option ("r") parses the received lines
> to determine validity by verifying the from/by chain, outputting
> "untrusted" in the place of a received header which is out of place. It
> also removes the "helo" portion unless the "e" option is passed in, in
> which case it prepends "helo-" to the helo string so that it is
> differentiated from the "rdns" or "ip" strings when bogofilter sees it.
> This is important since the helo string could easily be forged to
> represent your own server... I've seen this done often. Passing in the
> "d" option allows dns lookups (forward and reverse), while "f" forces
> "rdns" lookups even if already provided by the MTA. The "a" option
> looks up the ASN number and includes it on the received line. "w"
> allows the output of the body of the email in addition to the header,
> which is useful for testing on the command line, but not likely used in
> the MDA. And "b" provides benchmark info.
>
> So far, with all options enabled, the CPU time has not exceeded 0.1
> seconds on my fairly antiquated K6 server, with average times under 0.05
> seconds. Therefore, there isn't too much overhead associated with
> running this on all emails, and it reduces the number of tokens
> bogofilter needs to process and store.
>
> Following is a rather simple test of spamitarium's effectiveness. The
> file "golden.eml" is a spam. Alone, bogofilter does a very decent job
> on it, scoring 0.741945. But, as you can see, removing the header cruft
> boosts the score to 0.930029. Plus, since no ASN numbers or "helo-"
> strings have been registered in my database yet, that has no effect...
> after several registrations, this should boost the scores even more.
> Below is the diff:
>
> $ cat eml/golden.eml | bogofilter -vvv >golden_o.txt
> $ ./spamitarium -sreadw < eml/golden.eml | bogofilter -vvv >golden_s.txt
> $ diff golden_o.txt golden_s.txt
> 1c1
> < X-Bogosity: Yes, tests=bogofilter, spamicity=0.741945, version=0.17.5
> ---
> > X-Bogosity: Yes, tests=bogofilter, spamicity=0.930029, version=0.17.5
> 3d2
> < "head:Precedence" 3975 0.427694 0.005717 0.013214 +
> 5d3
> < "head:list" 4697 0.456455 0.008060 0.017369 +
> 8,10d5
> < "head:UID" 45 0.003878 0.000090 0.024718 +
> < "head:From" 25131 1.826951 0.059517 0.031553 +
> < "head:Sat" 3215 0.204718 0.008387 0.039381 +
> 15,16d9
> < "head:Apr" 346 0.010341 0.001214 0.105276 +
> < "head:oac-design.com" 17371 0.501697 0.061420 0.109075 +
> 28,29d20
> < "rcvd:forged" 10552 0.154629 0.041310 0.210833 +
> < "rcvd:may" 10552 0.154629 0.041310 0.210833 +
> 34d24
> < "rcvd:Postfix" 9514 0.106318 0.038128 0.263964 +
> 37d26
> < "head:tanderso" 6739 0.070932 0.027123 0.276618 +
> 74d62
> < "rcvd:SMTP" 120756 0.643400 0.502749 0.438642
> -
> 80,82c68
> < "rcvd:Fri" 36451 0.178866 0.152167 0.459674 -
> < "head:NI-!!8Ip!!`XY!!j" 0 0.000000 0.000000 0.460000 -
> < "rcvd:127.0.0.1" 0 0.000000 0.000000 0.460000 -
> ---
> > "rcvd:216.109.145.120" 0 0.000000 0.000000 0.460000 -
> 83a70,71
> > "rcvd:as30092" 0 0.000000 0.000000 0.460000 -
> > "rcvd:helo-mail.bonusempire.com" 0 0.000000 0.000000 0.460000
> -
> 95d82
> < "rcvd:with" 237477 0.964938 0.996702 0.508096
> -
> 108d94
> < "rcvd:ESMTP" 145729 0.379544 0.617296 0.619252
> -
> 135d120
> < "rcvd:PST" 61405 0.052351 0.262973 0.833976 +
> 143,144d127
> < "head:greatinternetoffers.com" 2 0.000000 0.000009 0.950909 +
> < "head:list1" 2 0.000000 0.000009 0.950909 +
> 151,152d133
> < "rcvd:localhost.bonusempire.com" 3 0.000000 0.000013 0.966250
> +
> < "rcvd:mail.bonusempire.com" 3 0.000000 0.000013 0.966250 +
> 161d141
> < "rcvd:playerbonuses.com" 9 0.000000 0.000039 0.988261 +
> 168,170d147
> < "head:X-Evolution-Source" 27 0.000000 0.000116 0.996029 +
> < "head:X-UIDL" 27 0.000000 0.000116 0.996029 +
>
> Now let's try registering it normally:
>
> $ cat eml/golden.eml | bogofilter -s
> $ cat eml/golden.eml | bogofilter -vvv >golden_o.txt
> $ ./spamitarium -sreadw < eml/golden.eml | bogofilter -vvv >golden_s.txt
> $ diff golden_o.txt golden_s.txt
> 1c1
> < X-Bogosity: Yes, tests=bogofilter, spamicity=0.811345, version=0.17.5
> ---
> > X-Bogosity: Yes, tests=bogofilter, spamicity=0.948691, version=0.17.5
> ...
>
> Both scores improve slightly. Now, let's unregister again, and then
> re-register using spamitarium this time:
>
> $ cat eml/golden.eml | bogofilter -S
> $ ./spamitarium -sreadw < eml/golden.eml | bogofilter -s
> $ cat eml/golden.eml | bogofilter -vvv >golden_o.txt
> $ ./spamitarium -sreadw < eml/golden.eml | bogofilter -vvv >golden_s.txt
> $ diff golden_o.txt golden_s.txt
> 1c1
> < X-Bogosity: Yes, tests=bogofilter, spamicity=0.773467, version=0.17.5
> ---
> > X-Bogosity: Yes, tests=bogofilter, spamicity=0.971580, version=0.17.5
>
> Wow, now the spamitarium version is even a stronger case for spam!
> Clearly, this seems to be working, even after only one registration.
> More registrations of similar spams can only improve this result.
> Here's the rest of the diff to see why:
>
> 3d2
> < "head:Precedence" 3975 0.427694 0.005717 0.013214 +
> 5d3
> < "head:list" 4697 0.456455 0.008059 0.017369 +
> 8,10d5
> < "head:UID" 45 0.003878 0.000090 0.024718 +
> < "head:From" 25132 1.826951 0.059520 0.031555 +
> < "head:Sat" 3216 0.204718 0.008391 0.039400 +
> 15,16d9
> < "head:Apr" 347 0.010341 0.001218 0.105608 +
> < "head:oac-design.com" 17371 0.501697 0.061419 0.109074 +
> 27,28d19
> < "rcvd:forged" 10552 0.154629 0.041309 0.210832 +
> < "rcvd:may" 10552 0.154629 0.041309 0.210832 +
> 34d24
> < "rcvd:Postfix" 9514 0.106318 0.038128 0.263963 +
> 37d26
> < "head:tanderso" 6739 0.070932 0.027123 0.276617 +
> 74d62
> < "rcvd:SMTP" 120757 0.643400 0.502749 0.438642
> -
> 80,82d67
> < "rcvd:Fri" 36451 0.178866 0.152166 0.459672 -
> < "head:NI-!!8Ip!!`XY!!j" 0 0.000000 0.000000 0.460000 -
> < "rcvd:127.0.0.1" 0 0.000000 0.000000 0.460000 -
> 94d78
> < "rcvd:with" 237478 0.964938 0.996698 0.508095
> -
> 107d90
> < "rcvd:ESMTP" 145729 0.379544 0.617291 0.619250
> -
> 134d116
> < "rcvd:PST" 61405 0.052351 0.262971 0.833975 +
> 135a118
> > "rcvd:216.109.145.120" 1 0.000000 0.000004 0.910000 +
> 136a120,121
> > "rcvd:as30092" 1 0.000000 0.000004 0.910000 +
> > "rcvd:helo-mail.bonusempire.com" 1 0.000000 0.000004 0.910000
> +
> 142,143d126
> < "head:greatinternetoffers.com" 2 0.000000 0.000009 0.950909 +
> < "head:list1" 2 0.000000 0.000009 0.950909 +
> 145,146d127
> < "rcvd:localhost.bonusempire.com" 3 0.000000 0.000013 0.966250
> +
> < "rcvd:mail.bonusempire.com" 3 0.000000 0.000013 0.966250 +
>
> Ok, so one spam got spammier. I don't think I need to show them here,
> but I tried this on several spams and got similar results. But what
> about hams? I don't have any very spammy hams, but here are some
> results on a ham:
>
> $ cat eml/aelan.eml | bogofilter -vvv >aelan_o.txt
> $ ./spamitarium -sreadw < eml/aelan.eml | bogofilter -vvv >aelan_s.txt
> $ diff aelan_o.txt aelan_s.txt
> 1c1
> < X-Bogosity: No, tests=bogofilter, spamicity=0.000702, version=0.17.5
> ---
> > X-Bogosity: No, tests=bogofilter, spamicity=0.000000, version=0.17.5
> ...
>
> Ok, so you can't much improve that score, but let's go ahead and
> register it as ham with spamitarium, and then finish the diff:
>
> $ ./spamitarium -sreadw < eml/aelan.eml | bogofilter -n
> $ cat eml/aelan.eml | bogofilter -vvv >aelan_o.txt
> $ ./spamitarium -sreadw < eml/aelan.eml | bogofilter -vvv >aelan_s.txt
> $ diff aelan_o.txt aelan_s.txt
> 1c1
> < X-Bogosity: No, tests=bogofilter, spamicity=0.000016, version=0.17.5
> ---
> > X-Bogosity: No, tests=bogofilter, spamicity=0.000000, version=0.17.5
>
> Even after registering (again) as ham, the non-spamitarium version still
> shows some doubt as to the hamminess, while the spamitarium version is
> even more confident still:
>
> 3d2
> < "rcvd:tanderso.public" 1031 0.164459 0.000056 0.000429 +
> 9d7
> < "head:From" 25132 1.826656 0.059520 0.031560 +
> 11d8
> < "head:X-Mailer" 13350 0.904039 0.033383 0.035618 +
> 14d10
> < "head:Sat" 3216 0.204685 0.008391 0.039406 +
> 28d23
> < "head:X-Priority" 9973 0.539257 0.028566 0.050315 +
> 34,35c29
> < "head:Jan" 5132 0.212278 0.016438 0.071884 +
> < "head:aelan.com" 1 0.000162 0.000000 0.076667 +
> ---
> > "rcvd:216.109.145.120" 2 0.000162 0.000004 0.065416 +
> 37a32,33
> > "rcvd:as22909" 1 0.000162 0.000000 0.076667 +
> > "rcvd:helo-68.32.210.63" 1 0.000162 0.000000 0.076667 +
> 46d41
> < "head:oac-design.com" 17371 0.501616 0.061419 0.109090 +
> 53d47
> < "rcvd:scriptlance.com" 7 0.000162 0.000026 0.146804 +
> 70d63
> < "head:High" 3015 0.032310 0.012119 0.272790 +
> 72d64
> < "head:tanderso" 6739 0.070921 0.027123 0.276649 +
> 104d95
> < "rcvd:SMTP" 120757 0.643296 0.502749 0.438682
> -
> 108,110d98
> < "head:4.40.0.60" 0 0.000000 0.000000 0.460000 -
> < "head:F!!Pg" 0 0.000000 0.000000 0.460000 -
> < "head:Y5!!K" 0 0.000000 0.000000 0.460000 -
> 118,119d105
> < "rcvd:oac-design.com" 210452 0.932149 0.881213 0.485955
> -
> < "rcvd:for" 217318 0.945234 0.910424 0.490621
> -
> 124d109
> < "rcvd:with" 237478 0.964782 0.996698 0.508136
> -
> 138d122
> < "head:Reply-To" 78361 0.237964 0.331024 0.581776 -
> 141d124
> < "head:webmaster" 596 0.001616 0.002523 0.609580 -
>
> As you can see, there are overall many less tokens with the spamitarium
> version as well:
>
> $ cat eml/aelan.eml | bogofilter -vv
> X-Bogosity: No, tests=bogofilter, spamicity=0.000016, version=0.17.5
> int cnt prob spamicity histogram
> 0.00 43 0.050839 0.031275 ##########################################
> 0.10 11 0.139405 0.049915 ###########
> 0.20 20 0.255632 0.108377 ####################
> 0.30 0 0.000000 0.108377
> 0.40 0 0.000000 0.108377
> 0.50 0 0.000000 0.108377
> 0.60 0 0.000000 0.108377
> 0.70 6 0.740370 0.174724 ######
> 0.80 1 0.838579 0.187499 #
> 0.90 5 0.989706 0.342782 #####
>
> $ ./spamitarium -sreadw < eml/aelan.eml | bogofilter -vv
> X-Bogosity: No, tests=bogofilter, spamicity=0.000000, version=0.17.5
> int cnt prob spamicity histogram
> 0.00 39 0.053818 0.034604 #######################################
> 0.10 9 0.141952 0.052405 #########
> 0.20 18 0.253511 0.112056 ##################
> 0.30 0 0.000000 0.112056
> 0.40 0 0.000000 0.112056
> 0.50 0 0.000000 0.112056
> 0.60 0 0.000000 0.112056
> 0.70 6 0.740370 0.186578 ######
> 0.80 1 0.838579 0.200618 #
> 0.90 1 0.960444 0.223613 #
>
> In this rather short email, there are 12 less scoring tokens. Those
> tokens that are now missing contributed adversely to the score. So, not
> only do we increase our accuracy, but we decrease our processing time
> and database size as well!
>
> I invite others to repeat these experiments on your own email. And
> please report any bugs and let me know if you find it useful. The code
> can be found here: http://www.orderamidchaos.com/bogofilter/spamitarium
>
> Tom
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040406/2ccf36b5/attachment.sig>
More information about the Bogofilter
mailing list