Prediction [was: spam addrs]

Wed Jun 30 00:06:55 CEST 2004

From: "David Relson" <relson at osagesoftware.com>
> Since the two Received lines are _real_, i.e. are created by standard
> MTAs, they are worth dealing with.  I'd describe the two cases as:
>
> Received: from [string] (domain [address]) ...
> Received: from [address] (helo=string) ...
>
> It's not clear to me (as the human reader) how to distinguish them when
> string and address both look like IP addresses.  That makes it a bit
> difficult to write parsing code :-(

It is difficult.  With regular expressions it's slightly easier.

> I've looked at spamitarium's regexes and confess that, to my
> inexperienced eye, they're complex.  Give me a simple rule for
> distinguishing them and I can try to implement it.

I don't think there is a simple rule like you propose.  Due to the different
formats given by different MTAs, and the ability for spammers to forge one
or more fields, it requires a complex expression.  Brackets and parentheses
are optional in many cases, IP and rDNS and IDENT information may or may not
be present, and these elements may all be arranged in many different ways.
For instance, here are a few:

$RDNS (HELO $HELO) ($LUSER@[$IP])
$RDNS (HELO $HELO) ([$IP])
$RDNS ([$IP] helo=$HELO)
$RDNS ($LUSER@$IP)
$RDNS($IP)
[$IP] (helo=$HELO ident=$LUSER)
$IP (account $LUSER HELO $HELO)
[$IP] (helo=$HELO)
$IP:?\d*? (HELO $HELO)
$HELO (IDENT:$LUSER@$RDNS [$IP]
$HELO (<$RDNS> [$IP])
$HELO ($IP ident=$LUSER)
$HELO (proxying for $IP) (user $LUSER)
$HELO (account $LUSER [$IP] verified)
et cetera, et cetera

I've identified as many different received lines as I could find (at least
100 variations) and I condensed them into a few dozen regexes in
spamitarium.  There's still no guarantee that it represents all of them
though.  The best you can do is pick the top 5-10 or so MTAs, take their
default received line output, munge the HELO ([^\s\0\/\\\#]+?) and LUSER
((\w|-|\.)+?) strings, and maybe the rDNS too, in as many ways as possible,
and attempt to derive a ruleset based on that.  This way it would at least
remove the cases I've considered from online services, firewalls, antivirus
programs, proxies, etc.  You might be left with only a few regexes worth.  I
still doubt that you'd make it nearly as simple as "first address after
'from'" though.

This is why I suggested simply not providing the proposed functionality to
log the IP address.  It's just too hard to determine without using complex
regexes unless you limit the MTAs to a very, very small set (1-2) and
require default setups at that.  This functionality is suited to a seperate
program.

> is doable.  Some thing like "last address that's not followed by '='" is
> also doable.  What's needed for this feature to be useful is still not
> clear to me.

I don't know what use a very dubious IP address would be.  There's no way at
present to produce an IP address with confidence.  I think my regexes are
pretty good, but I still wouldn't use the resulting IP in any way other than
to send it through a statistical filter since the damage from getting it
wrong would be minimal that way.

> Bogofilter has been quite successful in a "here's a program for
> operating systems a,b,c that handles features d,e,f".  When someone has
> a need for operating system "g", that's been added.  Trying to
> anticipate that operating systems "h,i,j" will also be wanted, has not
> been done.  Similarly, identifying features and having them be adequate
> to the environments in which they _are_ being used has been a goal.

Bogofilter is a great statistical filter.  It makes predictions on a scale
from 0-1.  It doesn't make binary decisions.  I suggest we just keep it that
way.

Tom