Prediction [was: spam addrs]

David Relson relson at
Tue Jun 29 20:27:20 CEST 2004

On Tue, 29 Jun 2004 14:01:57 -0400
Tom Anderson wrote:

> From: "David Relson" <relson at>
> > I've seen many software projects where time was spent trying to
> > anticipate everything the user wanted.  I've seen others where the
> > time was spent addressing the needs.  The "needs" based projects
> > tended to be more successful than the "wants" project -- because
> > it's impossible to anticipate what is really valuable.
> I don't think this is a matter of wants and needs.  We're not talking
> about adding new functionality, we're talking about making sure that a
> proposed functionality actually does what its supposed to do.  That
> is, output the correct IP of the sender.
> > So I'm willing to deal with what actually affects people and am not
> > willing to try to predict future spammer tricks.
> Most of the security problems in software today stem from the fact
> that developers assume that users are going to follow their intended
> path through the software.  Crackers try all of the unintended paths. 
> All it would require is proper bounds checking, input validation,
> etc., to close up most problems.  This is essentially the same issue
> here.  You are assuming that spammers will be kind and gentle with
> bogofilter, providing intended data. That's not a very good
> assumption.  Spammers are actively trying to defeat filtering
> software.
> > 'Tis nice that spamitarium can correctly process
> >
> >   Received: from helo-[] as209
> >     by
> >
> > but what MTA delivers this format (unbracketed address)?  I'm
> > interested in "out of the box" delivery formats, not "I'm going to
> > customize _my_ MTA's format so that it's different."
> Well, I know that Squirrelmail does, and maybe others.  But that
> wasn't my point here... what I'm saying is that any spammer can open
> an SMTP conversation with "HELO []", and MTAs (sendmail
> at least) will accept that as a valid HELO string.  The resulting
> sendmail received line,"out of the box," will be as follows:
> Received: from [] ( []) ...
> Now, if bogofilter looks for the first square-bracketed IP address,
> it's going to return  This of course was forged by the
> spammer, and the bogofilter user will end up blocking email from
> instead of  Let's say you're using Exim
> instead... now the received line might look like:
> Received: from [] (helo=[]) ...
> I don't use Exim, so I don't know if it will accept brackets in the
> HELO string or not, but as you can see, the IP is now at the end of
> the "from" portion instead of the front.  Looking at just the front or
> just the end, just bracketed IPs, etc., won't work unless you know the
> format of the MTA's received line that is being used.
> My point regarding spamitarium was that the regexes I used to
> determine what string is the HELO and which is the IP were successful
> even with a bracketed HELO string (which I wasn't even confident they
> would be, but in testing they were).  I don't see any other way of
> being even slightly confident in the IP being returned unless you are
> doing something similar.
> Why would you want to release new functionality with known
> vulnerabilities, and have to patch it later when spammers start taking
> advantage of them, rather than address the issue now?

Hi Tom,

Since the two Received lines are _real_, i.e. are created by standard
MTAs, they are worth dealing with.  I'd describe the two cases as:

Received: from [string] (domain [address]) ...
Received: from [address] (helo=string) ...

It's not clear to me (as the human reader) how to distinguish them when
string and address both look like IP addresses.  That makes it a bit
difficult to write parsing code :-(

I've looked at spamitarium's regexes and confess that, to my
inexperienced eye, they're complex.  Give me a simple rule for
distinguishing them and I can try to implement it.

Coding algorithms like:

   first address of first Received: line
   last address of first Received: line
   but not


   first address after "from"

is doable.  Some thing like "last address that's not followed by '='" is
also doable.  What's needed for this feature to be useful is still not
clear to me.

Bogofilter has been quite successful in a "here's a program for
operating systems a,b,c that handles features d,e,f".  When someone has
a need for operating system "g", that's been added.  Trying to
anticipate that operating systems "h,i,j" will also be wanted, has not
been done.  Similarly, identifying features and having them be adequate
to the environments in which they _are_ being used has been a goal.


More information about the Bogofilter mailing list