Why strip headers?
Tom Anderson
tanderso at oac-design.com
Fri May 6 20:50:14 CEST 2005
----- Original Message -----
From: "David Relson" <relson at osagesoftware.com>
> On Fri, 6 May 2005 11:59:34 +1000
> Ben Finney wrote:
>> On 05-May-2005, David Relson wrote:
>> > Ben Finney wrote:
>> > > =====
>> > > Moreover, headers which do not directly influence the email in any
>> > > functional way, nor are visible to the end-user in a standard
>> > > graphical MUA, are highly likely to contain information which
>> > > spammers think will detract from normal statistical filtering. It
>> > > is therefore desireable to remove these elements, specifically
>> > > X-headers, prior to filtering. Spamitarium removes all invisible,
>> > > non-functional header lines.
>> > > =====
>> > >
>> > > Is it foolishly naïve of me to think that bogofilter knows much
>> > > more about my personal mail history than some spammer, and can
>> > > judge those bogus headers as is?
>> >
>> > All bogofilter knows about your email is which ones you've told it
>> > are spam and which ones are ham. If there are different X-Headers
>> > it the two message sets, then their presence may well help
>> > bogofilter in its spam vs ham scoring.
>>
>> Right. So for messages that are *ham*, that contain X-Foo header
>> fields set by well-behaved software or knowledgeable correspondents,
>> why would I want bogofilter not to see those and learn from them?
Indeed some non-standard headers contain useful information, and some may
contribute to proper spam/ham identification. However, the main reason for
doing this is that sometimes the overwhelming majority of ham uses one or
several particular tokens which are not often found in spam, and only the
occassional spammer will figure out how to stick these into the headers in
such a way that these messages become hammy. When there are few other
tokens in the message on which to classify, then the headers will often
cause these spams to be classified as ham or unsure. I got tired of
receiving these 4-5 spams a week in my inbox which seemingly could not be
registered enough to make them spammy.
A classic example of this is a group of Viagra spams I had been receiving in
which the from address was spoofed as my own address, a couple of extra
received lines were thrown in containing servers from common online stores
or ezines, they put a "precedence: list" field in there which is hugely
hammy, used X-mailer which is hammy, used random X-header fields to insert
extra tokens containing my server name or IP address, etc., and the body was
just a single image. The tokens used to make these spams hammy are not
going to be made spammy or even unsure by registering this spam hundreds of
times because the overwhelming majority of hams reverse that score.
Sometimes it doesn't even take a clever spammer to fool bogofilter, but just
someone using the same email provider as lots of your friends.
For this reason, I invented spamitarium to foil spammers' efforts to tilt
messages toward ham using email headers. First, it determines which
received lines are legitimate by following the from/by chain backward
starting from the top (set by your own server). If any received lines fail
this test, they are removed. Actually, they are changed to "Received:
untrusted" -- a token which becomes rather spammy. Further, any received
lines containing local or invalid IPs are removed. All IP addresses are
resolved and reverse lookups performed on domains to double check if any of
them are false. It also prepends the HELO string with "helo-" to
differentiate it from other tokens, thus helo-mail.osagesoftware.com is
different from helo-osagesoftware.com... the first one is hammy since that's
how legitimate emails are sent from mail.osagesoftware.com, and the second
one would probably be spammy if a spammer does a reverse DNS lookup and uses
the result (osagesoftware.com) as a HELO string in a spoofed received line.
This one corrects a lot of problems from spammers using my own server in the
HELO string, not only foiling their attempt at making it look hammy, but
sealing the fate of this spam by making it look extremely spammy. On top of
all that, spamitarium looks up the ASN number on all IPs so that bogofilter
can classify whole regions as hammy or spammy based on the preponderance of
email that comes from that ASN. After dealing with the received lines, it
then removes any non-standard (not specified in any RFC) headers, which
would include "Precedence" and "X-mailer" and anything else containing
spammer-defined but not user-visible info. Most of these modifications can
be turned on and off with command line switches as described in the docs
(spamitarium -h).
>> > Some (many?) mail delivery agents add X-Header lines to a message.
>> > If _yours_ adds one or X-Header lines, bogofilter will see them in
>> > _every_ ham and _every_ spam. The result is tokens with scores of
>> > 0.5 which are ignored when scoring.
>>
>> And if I want bogofilter to learn from the X-Foo header fields, how
>> does stripping them help me?
>>
>> In particular, many administrators configure spamassassin to make
>> decisions about a mail and put those decisions in X-Spam or other
>> header fields, so that individual users can decide for themselves
>> about whether or not to dump the message. This links nicely with
>> bogofilter, since it can learn about spam or ham by seeing how well my
>> decisions match spamassassin's results.
If you use spamitarium (you don't have to), it only strips the non-standard
fields if you pass in the parameter "s". Otherwise, it will allow all
headers through. If you're using spamassassin and wish to take advantage of
spamitarium's other features, just don't use the "s" parameter.
>> > Stripping X-Header lines, as Tom does, may or may not have an
>> > effect. It all depends on your particular mail setup.
>>
>> My main concern with spamitarium is that it assumes X-Foo header
>> fields are malicious by default. On the contrary, there is often a lot
>> of useful information in them that bogofilter can learn from.
I'm not assuming that X-Foo is malicious, but rather that it could be, and
that it serves no practical standard purpose as defined in any RFC.
Therefore, it doesn't hurt to remove it, and it can certainly help. If you
don't like this assumption, then either use spamitarium without the "s"
parameter, or don't use it at all. So far, this assumption has worked out
very well on my own email. And I think it would probably help Joe Hill with
his problem where the same email goes up and down in spamicity due to
certain header fields being registered in lots of his ham.
> Since Tom is distributing the source code, you can modify it as you see
> fit. If I were in your shoes, I'd implement the change as a command
> line switch and send a patch to Tom. If he accepted the patch, then
> you might be able to use future versions without the need to customize
> them.
As I mentioned, the command line switch is already there to turn on
non-standard field stripping. Using spamitarium without it does not strip
any fields. I'm also open to any suggestions for improvements or more
functionality and would very much be interested in patches. I don't want
spamitarium to turn into spamassassin with tons of constantly changing rules
and heuristics though; rather, it is just to verify and standardize the
header. By removing the non-standard and fake information, bogofilter is
able to better distinguish between ham and spam based on the real and
standard information provided.
Tom
More information about the Bogofilter
mailing list