[bogofilter] Re: headers

Tom Anderson tanderso at oac-design.com
Tue Apr 6 16:05:46 CEST 2004


This bounced the first time around...

On Sat, 2004-04-03 at 04:37, Tom Anderson wrote:
> As I promised previously, I've built a program which will manipulate
> email headers favorably before filtering.  It's called "spamitarium",
> since it's where spams go to get their heads fixed ;)
> 
> There are a number of options I've included.  The most important
> (causing the best results) is only allowing standard (RFC 2822, et al)
> header lines.  Passing the "s" parameter does this.  All other (X-, etc)
> lines are stripped.  The second option ("r") parses the received lines
> to determine validity by verifying the from/by chain, outputting
> "untrusted" in the place of a received header which is out of place.  It
> also removes the "helo" portion unless the "e" option is passed in, in
> which case it prepends "helo-" to the helo string so that it is
> differentiated from the "rdns" or "ip" strings when bogofilter sees it. 
> This is important since the helo string could easily be forged to
> represent your own server... I've seen this done often.  Passing in the
> "d" option allows dns lookups (forward and reverse), while "f" forces
> "rdns" lookups even if already provided by the MTA.  The "a" option
> looks up the ASN number and includes it on the received line.  "w"
> allows the output of the body of the email in addition to the header,
> which is useful for testing on the command line, but not likely used in
> the MDA.  And "b" provides benchmark info.
> 
> So far, with all options enabled, the CPU time has not exceeded 0.1
> seconds on my fairly antiquated K6 server, with average times under 0.05
> seconds.  Therefore, there isn't too much overhead associated with
> running this on all emails, and it reduces the number of tokens
> bogofilter needs to process and store.
> 
> Following is a rather simple test of spamitarium's effectiveness.  The
> file "golden.eml" is a spam.  Alone, bogofilter does a very decent job
> on it, scoring 0.741945.  But, as you can see, removing the header cruft
> boosts the score to 0.930029.  Plus, since no ASN numbers or "helo-"
> strings have been registered in my database yet, that has no effect...
> after several registrations, this should boost the scores even more. 
> Below is the diff:
> 
> $ cat eml/golden.eml | bogofilter -vvv >golden_o.txt
> $ ./spamitarium -sreadw < eml/golden.eml | bogofilter -vvv >golden_s.txt
> $ diff golden_o.txt golden_s.txt 
> 1c1
> < X-Bogosity: Yes, tests=bogofilter, spamicity=0.741945, version=0.17.5
> ---
> > X-Bogosity: Yes, tests=bogofilter, spamicity=0.930029, version=0.17.5
> 3d2
> < "head:Precedence"                 3975  0.427694  0.005717  0.013214 +
> 5d3
> < "head:list"                       4697  0.456455  0.008060  0.017369 +
> 8,10d5
> < "head:UID"                          45  0.003878  0.000090  0.024718 +
> < "head:From"                      25131  1.826951  0.059517  0.031553 +
> < "head:Sat"                        3215  0.204718  0.008387  0.039381 +
> 15,16d9
> < "head:Apr"                         346  0.010341  0.001214  0.105276 +
> < "head:oac-design.com"            17371  0.501697  0.061420  0.109075 +
> 28,29d20
> < "rcvd:forged"                    10552  0.154629  0.041310  0.210833 +
> < "rcvd:may"                       10552  0.154629  0.041310  0.210833 +
> 34d24
> < "rcvd:Postfix"                    9514  0.106318  0.038128  0.263964 +
> 37d26
> < "head:tanderso"                   6739  0.070932  0.027123  0.276618 +
> 74d62
> < "rcvd:SMTP"                      120756  0.643400  0.502749  0.438642
> -
> 80,82c68
> < "rcvd:Fri"                       36451  0.178866  0.152167  0.459674 -
> < "head:NI-!!8Ip!!`XY!!j"              0  0.000000  0.000000  0.460000 -
> < "rcvd:127.0.0.1"                     0  0.000000  0.000000  0.460000 -
> ---
> > "rcvd:216.109.145.120"               0  0.000000  0.000000  0.460000 -
> 83a70,71
> > "rcvd:as30092"                       0  0.000000  0.000000  0.460000 -
> > "rcvd:helo-mail.bonusempire.com"      0  0.000000  0.000000  0.460000
> -
> 95d82
> < "rcvd:with"                      237477  0.964938  0.996702  0.508096
> -
> 108d94
> < "rcvd:ESMTP"                     145729  0.379544  0.617296  0.619252
> -
> 135d120
> < "rcvd:PST"                       61405  0.052351  0.262973  0.833976 +
> 143,144d127
> < "head:greatinternetoffers.com"       2  0.000000  0.000009  0.950909 +
> < "head:list1"                         2  0.000000  0.000009  0.950909 +
> 151,152d133
> < "rcvd:localhost.bonusempire.com"      3  0.000000  0.000013  0.966250
> +
> < "rcvd:mail.bonusempire.com"          3  0.000000  0.000013  0.966250 +
> 161d141
> < "rcvd:playerbonuses.com"             9  0.000000  0.000039  0.988261 +
> 168,170d147
> < "head:X-Evolution-Source"           27  0.000000  0.000116  0.996029 +
> < "head:X-UIDL"                       27  0.000000  0.000116  0.996029 +
> 
> Now let's try registering it normally:
> 
> $ cat eml/golden.eml | bogofilter -s                
> $ cat eml/golden.eml | bogofilter -vvv >golden_o.txt
> $ ./spamitarium -sreadw < eml/golden.eml | bogofilter -vvv >golden_s.txt
> $ diff golden_o.txt golden_s.txt 
> 1c1
> < X-Bogosity: Yes, tests=bogofilter, spamicity=0.811345, version=0.17.5
> ---
> > X-Bogosity: Yes, tests=bogofilter, spamicity=0.948691, version=0.17.5
> ...
> 
> Both scores improve slightly.  Now, let's unregister again, and then
> re-register using spamitarium this time:
> 
> $ cat eml/golden.eml | bogofilter -S
> $ ./spamitarium -sreadw < eml/golden.eml | bogofilter -s                
> $ cat eml/golden.eml | bogofilter -vvv >golden_o.txt
> $ ./spamitarium -sreadw < eml/golden.eml | bogofilter -vvv >golden_s.txt
> $ diff golden_o.txt golden_s.txt 
> 1c1
> < X-Bogosity: Yes, tests=bogofilter, spamicity=0.773467, version=0.17.5
> ---
> > X-Bogosity: Yes, tests=bogofilter, spamicity=0.971580, version=0.17.5
> 
> Wow, now the spamitarium version is even a stronger case for spam! 
> Clearly, this seems to be working, even after only one registration. 
> More registrations of similar spams can only improve this result. 
> Here's the rest of the diff to see why:
> 
> 3d2
> < "head:Precedence"                 3975  0.427694  0.005717  0.013214 +
> 5d3
> < "head:list"                       4697  0.456455  0.008059  0.017369 +
> 8,10d5
> < "head:UID"                          45  0.003878  0.000090  0.024718 +
> < "head:From"                      25132  1.826951  0.059520  0.031555 +
> < "head:Sat"                        3216  0.204718  0.008391  0.039400 +
> 15,16d9
> < "head:Apr"                         347  0.010341  0.001218  0.105608 +
> < "head:oac-design.com"            17371  0.501697  0.061419  0.109074 +
> 27,28d19
> < "rcvd:forged"                    10552  0.154629  0.041309  0.210832 +
> < "rcvd:may"                       10552  0.154629  0.041309  0.210832 +
> 34d24
> < "rcvd:Postfix"                    9514  0.106318  0.038128  0.263963 +
> 37d26
> < "head:tanderso"                   6739  0.070932  0.027123  0.276617 +
> 74d62
> < "rcvd:SMTP"                      120757  0.643400  0.502749  0.438642
> -
> 80,82d67
> < "rcvd:Fri"                       36451  0.178866  0.152166  0.459672 -
> < "head:NI-!!8Ip!!`XY!!j"              0  0.000000  0.000000  0.460000 -
> < "rcvd:127.0.0.1"                     0  0.000000  0.000000  0.460000 -
> 94d78
> < "rcvd:with"                      237478  0.964938  0.996698  0.508095
> -
> 107d90
> < "rcvd:ESMTP"                     145729  0.379544  0.617291  0.619250
> -
> 134d116
> < "rcvd:PST"                       61405  0.052351  0.262971  0.833975 +
> 135a118
> > "rcvd:216.109.145.120"               1  0.000000  0.000004  0.910000 +
> 136a120,121
> > "rcvd:as30092"                       1  0.000000  0.000004  0.910000 +
> > "rcvd:helo-mail.bonusempire.com"      1  0.000000  0.000004  0.910000
> +
> 142,143d126
> < "head:greatinternetoffers.com"       2  0.000000  0.000009  0.950909 +
> < "head:list1"                         2  0.000000  0.000009  0.950909 +
> 145,146d127
> < "rcvd:localhost.bonusempire.com"      3  0.000000  0.000013  0.966250
> +
> < "rcvd:mail.bonusempire.com"          3  0.000000  0.000013  0.966250 +
> 
> Ok, so one spam got spammier.  I don't think I need to show them here,
> but I tried this on several spams and got similar results.  But what
> about hams?  I don't have any very spammy hams, but here are some
> results on a ham:
> 
> $ cat eml/aelan.eml | bogofilter -vvv >aelan_o.txt
> $ ./spamitarium -sreadw < eml/aelan.eml | bogofilter -vvv >aelan_s.txt
> $ diff aelan_o.txt aelan_s.txt   
> 1c1
> < X-Bogosity: No, tests=bogofilter, spamicity=0.000702, version=0.17.5
> ---
> > X-Bogosity: No, tests=bogofilter, spamicity=0.000000, version=0.17.5
> ...
> 
> Ok, so you can't much improve that score, but let's go ahead and
> register it as ham with spamitarium, and then finish the diff:
> 
> $ ./spamitarium -sreadw < eml/aelan.eml | bogofilter -n               
> $ cat eml/aelan.eml | bogofilter -vvv >aelan_o.txt
> $ ./spamitarium -sreadw < eml/aelan.eml | bogofilter -vvv >aelan_s.txt
> $ diff aelan_o.txt aelan_s.txt 
> 1c1
> < X-Bogosity: No, tests=bogofilter, spamicity=0.000016, version=0.17.5
> ---
> > X-Bogosity: No, tests=bogofilter, spamicity=0.000000, version=0.17.5
> 
> Even after registering (again) as ham, the non-spamitarium version still
> shows some doubt as to the hamminess, while the spamitarium version is
> even more confident still:
> 
> 3d2
> < "rcvd:tanderso.public"            1031  0.164459  0.000056  0.000429 +
> 9d7
> < "head:From"                      25132  1.826656  0.059520  0.031560 +
> 11d8
> < "head:X-Mailer"                  13350  0.904039  0.033383  0.035618 +
> 14d10
> < "head:Sat"                        3216  0.204685  0.008391  0.039406 +
> 28d23
> < "head:X-Priority"                 9973  0.539257  0.028566  0.050315 +
> 34,35c29
> < "head:Jan"                        5132  0.212278  0.016438  0.071884 +
> < "head:aelan.com"                     1  0.000162  0.000000  0.076667 +
> ---
> > "rcvd:216.109.145.120"               2  0.000162  0.000004  0.065416 +
> 37a32,33
> > "rcvd:as22909"                       1  0.000162  0.000000  0.076667 +
> > "rcvd:helo-68.32.210.63"             1  0.000162  0.000000  0.076667 +
> 46d41
> < "head:oac-design.com"            17371  0.501616  0.061419  0.109090 +
> 53d47
> < "rcvd:scriptlance.com"               7  0.000162  0.000026  0.146804 +
> 70d63
> < "head:High"                       3015  0.032310  0.012119  0.272790 +
> 72d64
> < "head:tanderso"                   6739  0.070921  0.027123  0.276649 +
> 104d95
> < "rcvd:SMTP"                      120757  0.643296  0.502749  0.438682
> -
> 108,110d98
> < "head:4.40.0.60"                     0  0.000000  0.000000  0.460000 -
> < "head:F!!Pg"                         0  0.000000  0.000000  0.460000 -
> < "head:Y5!!K"                         0  0.000000  0.000000  0.460000 -
> 118,119d105
> < "rcvd:oac-design.com"            210452  0.932149  0.881213  0.485955
> -
> < "rcvd:for"                       217318  0.945234  0.910424  0.490621
> -
> 124d109
> < "rcvd:with"                      237478  0.964782  0.996698  0.508136
> -
> 138d122
> < "head:Reply-To"                  78361  0.237964  0.331024  0.581776 -
> 141d124
> < "head:webmaster"                   596  0.001616  0.002523  0.609580 -
> 
> As you can see, there are overall many less tokens with the spamitarium
> version as well:
> 
> $ cat eml/aelan.eml | bogofilter -vv              
> X-Bogosity: No, tests=bogofilter, spamicity=0.000016, version=0.17.5
>    int  cnt   prob  spamicity histogram
>   0.00   43 0.050839 0.031275 ##########################################
>   0.10   11 0.139405 0.049915 ###########
>   0.20   20 0.255632 0.108377 ####################
>   0.30    0 0.000000 0.108377 
>   0.40    0 0.000000 0.108377 
>   0.50    0 0.000000 0.108377 
>   0.60    0 0.000000 0.108377 
>   0.70    6 0.740370 0.174724 ######
>   0.80    1 0.838579 0.187499 #
>   0.90    5 0.989706 0.342782 #####
> 
> $ ./spamitarium -sreadw < eml/aelan.eml | bogofilter -vv              
> X-Bogosity: No, tests=bogofilter, spamicity=0.000000, version=0.17.5
>    int  cnt   prob  spamicity histogram
>   0.00   39 0.053818 0.034604 #######################################
>   0.10    9 0.141952 0.052405 #########
>   0.20   18 0.253511 0.112056 ##################
>   0.30    0 0.000000 0.112056 
>   0.40    0 0.000000 0.112056 
>   0.50    0 0.000000 0.112056 
>   0.60    0 0.000000 0.112056 
>   0.70    6 0.740370 0.186578 ######
>   0.80    1 0.838579 0.200618 #
>   0.90    1 0.960444 0.223613 #
> 
> In this rather short email, there are 12 less scoring tokens.  Those
> tokens that are now missing contributed adversely to the score.  So, not
> only do we increase our accuracy, but we decrease our processing time
> and database size as well!
> 
> I invite others to repeat these experiments on your own email.  And
> please report any bugs and let me know if you find it useful.  The code
> can be found here: http://www.orderamidchaos.com/bogofilter/spamitarium
> 
> Tom
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040406/2ccf36b5/attachment.sig>


More information about the Bogofilter mailing list