info about spam messages

Mon Jun 14 16:07:57 CEST 2004

From: "David Relson" <relson at osagesoftware.com>
> In any case, I was thinking of the IP address as being an optional
> formatting character.  My guess is that, most people won't care that it
> exists and usage won't be widespread.  That being said, address parsing
> would only need to be "good enough" for those who want it.

Figuring that out isn't really bogofilter's purpose though.  Curtailing
feature creep is always a good idea.  Small, fast code that does one thing
well.  Most people won't care about adding a bunch of parsing code, until in
a year or two of this the executable is bigger than MS Word.

> You've done a lot of work in this area and that work suggests an
> alternate approach to me.   Spamitarium already has much (all?) of the
> wanted ability for finding the IP address.  I'm sure a minor tweak would
> enable it to output the IP address. Right, Tom?  Assuming this is so,
> here's my idea:

Not necessarily.  Spamitarium will replace known invalid received lines with
"untrusted".  It knows they're not valid if the "from" and the "by" don't
match up from one line to the next.  It will also determine which address in
a line is the actual IP as inserted by the receiving MTA, rather than
something inserted by the sender.  This part is less certain... I've tested
it on lots and lots of received strings without error, but that doesn't mean
tomorrow someone isn't going to find something different out there.  The
regexes will need to be updated if any MTAs wildly change their received
lines, or if there is a less-known MTA out there that isn't covered yet.
There's some flexibility built-in, but it'd be a real problem if two exact
same received strings meant two different things to different MTAs.  Also,
spamitarium cannot discern which address is ultimately the spammer.  The
best it can do is remove the known-invalid stuff, ignore local addresses,
perform look-ups on everything else, and insert an ASN to boot.  In many
emails, there will still be multiple addresses in the chain.  Which one is
the one to output in the log?  All of
them?

For instance, let's say that a received line looks like this:

Received: from mail.osagesoftware.com (osagesoftware.com [216.144.204.42])
 by oac-design.com with esmtp (Exim 4.34)
 id 1BZpAT-0003cg-J2
 for tanderson at oac-design.com; Mon, 14 Jun 2004 08:35:32 -0400

Spamitarium can do this:

Received: from helo-mail.osagesoftware.com osagesoftware.com 216.144.204.42
as19326
   by oac-design.com 216.109.145.120
   for <tanderso at oac-design.com>; Mon, 14 Jun 2004 08:35:32 -0400

You'll see that "mail.osagesoftware.com" has been prepended with "helo-" to
identify this string as a HELO string and not an rDNS.  The rDNS has been
identified as "osagesoftware.com" from the IP "216.144.204.42".  An ASN has
been looked up and added.  Also, the receiving MTA IP has been queried as
well in order to facilitate comparison to any lines above it.  The
unnecessary cruft like the protocol, server type, and message id are
removed.  If the following line appeared below the one above, it would be
considered valid:

Received: from satan.hell.com (imaspammer.com [6.6.6.6])
 by osagesoftware.com with smtp (Sendmail 1.2.3)
 id blah
 for tanderson at oac-design.com; Mon, 14 Jun 2004 08:35:32 -0400

It's valid because it appears as though "osagesoftware.com" (which would
resolve to 216.144.204.42 when spamitarium looked it up) received the email
from "imaspammer.com" and then relayed it to "oac-design.com".  There's
really no way to determine whether this line is invalid other than via the
from/by chain, or if the IP is local or reserved.  If you're a clever
spammer, you may decide to insert a line like this below your actual
server's received line.  If you were really clever, you might add this line:

Received: from oac-design.com (oac-design.com [216.109.145.120])
 by imaspammer.com with emp (Evilmail 6.6.6)
 id blahblah
for tanderson at oac-design.com; Mon, 14 Jun 2004 08:35:32 -0400

Now it looks like my own server relayed a message through your server to get
back to me.  If I put this IP in my log as a spammer, I might start blocking
email from myself.  The clever spammer says, "Hey, if I can't get through,
neither can you!  Ha, ha, ha...."  So, as you can see, spamitarium could
output all of the presumably valid lines, and even check to be sure the IP,
rDNS, etc., match up.  However, that doesn't mean that the spammer hasn't
played a trick on you.  Sure, you'll record the spammer's address and any
intermediate open relays, but you'll also record innocent addresses as well.
And if you can't distinguish the spam addresses from the forged ones, then
none of them are useable.

The reason this is OK if we're using spamitarium with bogofilter is because
we're not judging the email solely on these received lines.  It just removes
excess noise that bogofilter would have scored the email on anyway.  So it
improves the overall bogofilter accuracy even if the spammer inserted lots
of bogus lines.  The emails will be scored according to the content of the
message in addition to the headers.

I wouldn't try to use just the headers alone, although this may be the case
anyway with blank bodies or just single-image bodies.

> I've noticed that postfix logs a "connect from example.com[1.2.3.4]"
> message.  Further validation of the address in (3) can be done by
> comparing to the system log.  With that check, you'll know if you've got
> the proper address for the machine sending the unwanted message.

Postfix can only tell you for certain the immediate sender of the email, not
whether there were any senders before or what they were.  The immediate
sender could be an open relay or even a server at your own ISP.  Other
times, spammers will send it through a server at their ISP, but the end of
the chain is an IP address on their own machine.  Eg., they'll send it from
"1.2.3.4-spamtown-adsl.verizon.com [1.2.3.4]" through "mail.verizon.com".
If you just log the ISP line, then you miss any opportunity to nail them
down to the IP they were assigned.  Or maybe they actually sent it from the
ISP account, and the local IP is a bogus line.  You can't know for certain.

Tom