Why strip headers?

Fri May 6 20:50:14 CEST 2005

----- Original Message ----- 
From: "David Relson" <relson at osagesoftware.com>
> On Fri, 6 May 2005 11:59:34 +1000
> Ben Finney wrote:
>> On 05-May-2005, David Relson wrote:
>> > Ben Finney wrote:
>> > > =====
>> > > Moreover, headers which do not directly influence the email in any
>> > > functional way, nor are visible to the end-user in a standard
>> > > graphical MUA, are highly likely to contain information which
>> > > spammers think will detract from normal statistical filtering. It
>> > > is therefore desireable to remove these elements, specifically
>> > > X-headers, prior to filtering.  Spamitarium removes all invisible,
>> > > non-functional header lines.
>> > > =====
>> > >
>> > > Is it foolishly naïve of me to think that bogofilter knows much
>> > > more about my personal mail history than some spammer, and can
>> > > judge those bogus headers as is?
>> >
>> > All bogofilter knows about your email is which ones you've told it
>> > are spam and which ones are ham.  If there are different X-Headers
>> > it the two message sets, then their presence may well help
>> > bogofilter in its spam vs ham scoring.
>>
>> Right. So for messages that are *ham*, that contain X-Foo header
>> fields set by well-behaved software or knowledgeable correspondents,
>> why would I want bogofilter not to see those and learn from them?

Indeed some non-standard headers contain useful information, and some may 
contribute to proper spam/ham identification.  However, the main reason for 
doing this is that sometimes the overwhelming majority of ham uses one or 
several particular tokens which are not often found in spam, and only the 
occassional spammer will figure out how to stick these into the headers in 
such a way that these messages become hammy.  When there are few other 
tokens in the message on which to classify, then the headers will often 
cause these spams to be classified as ham or unsure.  I got tired of 
receiving these 4-5 spams a week in my inbox which seemingly could not be 
registered enough to make them spammy.

A classic example of this is a group of Viagra spams I had been receiving in 
which the from address was spoofed as my own address, a couple of extra 
received lines were thrown in containing servers from common online stores 
or ezines, they put a "precedence: list" field in there which is hugely 
hammy, used X-mailer which is hammy, used random X-header fields to insert 
extra tokens containing my server name or IP address, etc., and the body was 
just a single image.  The tokens used to make these spams hammy are not 
going to be made spammy or even unsure by registering this spam hundreds of 
times because the overwhelming majority of hams reverse that score. 
Sometimes it doesn't even take a clever spammer to fool bogofilter, but just 
someone using the same email provider as lots of your friends.

For this reason, I invented spamitarium to foil spammers' efforts to tilt 
messages toward ham using email headers.  First, it determines which 
received lines are legitimate by following the from/by chain backward 
starting from the top (set by your own server).  If any received lines fail 
this test, they are removed.  Actually, they are changed to "Received: 
untrusted" -- a token which becomes rather spammy.  Further, any received 
lines containing local or invalid IPs are removed.  All IP addresses are 
resolved and reverse lookups performed on domains to double check if any of 
them are false.  It also prepends the HELO string with "helo-" to 
differentiate it from other tokens, thus helo-mail.osagesoftware.com is 
different from helo-osagesoftware.com... the first one is hammy since that's 
how legitimate emails are sent from mail.osagesoftware.com, and the second 
one would probably be spammy if a spammer does a reverse DNS lookup and uses 
the result (osagesoftware.com) as a HELO string in a spoofed received line. 
This one corrects a lot of problems from spammers using my own server in the 
HELO string, not only foiling their attempt at making it look hammy, but 
sealing the fate of this spam by making it look extremely spammy.  On top of 
all that, spamitarium looks up the ASN number on all IPs so that bogofilter 
can classify whole regions as hammy or spammy based on the preponderance of 
email that comes from that ASN.  After dealing with the received lines, it 
then removes any non-standard (not specified in any RFC) headers, which 
would include "Precedence" and "X-mailer" and anything else containing 
spammer-defined but not user-visible info.  Most of these modifications can 
be turned on and off with command line switches as described in the docs 
(spamitarium -h).

>> > Some (many?) mail delivery agents add X-Header lines to a message.
>> > If _yours_ adds one or X-Header lines, bogofilter will see them in
>> > _every_ ham and _every_ spam.  The result is tokens with scores of
>> > 0.5 which are ignored when scoring.
>>
>> And if I want bogofilter to learn from the X-Foo header fields, how
>> does stripping them help me?
>>
>> In particular, many administrators configure spamassassin to make
>> decisions about a mail and put those decisions in X-Spam or other
>> header fields, so that individual users can decide for themselves
>> about whether or not to dump the message.  This links nicely with
>> bogofilter, since it can learn about spam or ham by seeing how well my
>> decisions match spamassassin's results.

If you use spamitarium (you don't have to), it only strips the non-standard 
fields if you pass in the parameter "s".  Otherwise, it will allow all 
headers through.  If you're using spamassassin and wish to take advantage of 
spamitarium's other features, just don't use the "s" parameter.

>> > Stripping X-Header lines, as Tom does, may or may not have an
>> > effect. It all depends on your particular mail setup.
>>
>> My main concern with spamitarium is that it assumes X-Foo header
>> fields are malicious by default. On the contrary, there is often a lot
>> of useful information in them that bogofilter can learn from.

I'm not assuming that X-Foo is malicious, but rather that it could be, and 
that it serves no practical standard purpose as defined in any RFC. 
Therefore, it doesn't hurt to remove it, and it can certainly help.  If you 
don't like this assumption, then either use spamitarium without the "s" 
parameter, or don't use it at all.  So far, this assumption has worked out 
very well on my own email.  And I think it would probably help Joe Hill with 
his problem where the same email goes up and down in spamicity due to 
certain header fields being registered in lots of his ham.

> Since Tom is distributing the source code, you can modify it as you see
> fit.  If I were in your shoes, I'd implement the change as a command
> line switch and send a patch to Tom.  If he accepted the patch, then
> you might be able to use future versions without the need to customize
> them.

As I mentioned, the command line switch is already there to turn on 
non-standard field stripping.  Using spamitarium without it does not strip 
any fields.  I'm also open to any suggestions for improvements or more 
functionality and would very much be interested in patches.  I don't want 
spamitarium to turn into spamassassin with tons of constantly changing rules 
and heuristics though; rather, it is just to verify and standardize the 
header.  By removing the non-standard and fake information, bogofilter is 
able to better distinguish between ham and spam based on the real and 
standard information provided.

Tom