spam addrs

Tue Jun 15 04:21:35 CEST 2004

David Relson wrote:
> Greetings,
> 
> I've been looking at bogofilter's parsing code with an eye to making a
> message's IP address available for logging.  
> 
> Bogofilter's lexer already has an IPADDR pattern for identifying ip
> addresses and returns an appropriate type to the get_token() function.
> The function also knows when it's processing a Received: header
> statement.  Together those two bits of info form the basis of keeping
> the message's IP address for (optional) use in logging mesages.
> 
> My first version saved the first IP address seen in a Received:
> statement.  This works fine in many cases, for example:
> 
> Received: from aol.com (machine.domain.com [192.255.1.2])
> 
> However, if the machine name is of form "1.2.3.4.domain.com", the saved
> value will be "1.2.3.4", which is wrong.
> 
> The second version is a bit more complex.  Save the last IP address of
> the first Received: statement containing an IP address.  That will give
> the correct answer for:
> 
> Received: (qmail 937 invoked from network); 2 Feb 2004 19:21:52 -0000
> Received: from natmout00.rzone.de (natmout00.rzone.de [81.169.145.163])
> 	by mail.nn7.de (8.12.10/8.12.10) with ESMTP id i12JLAWl009417
> 	for <bugreports at nn7.de>; Mon, 2 Feb 2004 20:21:10 +0100 (MET)
> 
> but not for:
> 
> Received: (qmail 937 invoked from network); 2 Feb 2004 19:21:52 -0000
> Received: from unknown (HELO localhost) (127.0.0.1)
>   by localhost with SMTP; 2 Feb 2004 19:21:52 -0000
> Received: from natmout00.rzone.de (natmout00.rzone.de [81.169.145.163])
> 	by mail.nn7.de (8.12.10/8.12.10) with ESMTP id i12JLAWl009417
> 	for <bugreports at nn7.de>; Mon, 2 Feb 2004 20:21:10 +0100 (MET)
> 
> The third version excludes "but not 127.0.0.1".
> 
> The actual work is done by a simple state machine in token.c, a global
> variable ipaddr for saving the value, and 'I' recognition in format.c.
> 
> Use of the capability is via bogofilter.cf statements like:
> 
> header_format = "%h: %c, spamicity=%p, version=%v, ipaddr=%I
> log_header_format = "%h: %c, spamicity=%p, version=%v, ipaddr=%I
> 
> In cases where the message is forwarded multiple times by internal mail
> servers, this method of identifying the ip address will likely identify
> one of those servers, making the saved information not useful.  When/if
> someone has more complex needs, they can ask for help in implementing
> what they actually need.
> 
> As has been mentionned, using this ipaddress for blacklisting should
> only be done after further checking and/or validation of the address.
> 
> Any comments?
> 

This can be pulled from postfix mail.info readily enough.  Or any other 
MTA log.

That, or the Received headers will provide it consistently as well.
Similarly, I think that each MTA can provide a format for extracting the 
received headers more consistently and with greater reliability than you 
will want to manage under bogofilter.

You have to keep in mind that not only do you have to accomodate the ip 
address representations for every mail MTA that exists, but also manage 
to work around all the bogus kludges that exist in spam.

This is not going to be a trivial project.

It was always my assumption that you captured all the patterns that 
matched an IP address and didn't look at which one was 
first/second/third on a list.  Whatever the process I did find a slight 
improvement by using URLs.

Personally, I would keep bogofilter focused on filtering spam based on 
the identification of tokens to give a probability that the message is 
ham/spam.  That is why bogofilter started and why it's so effective.

Logging, IMHO, is intended to provide information on the status and 
function of the running application and not necessarily work as a 
datafeed for additional analysis.  The exception to this is if you have 
no other alternative.  For MTA and bogofilter this is clearly not the case.

For a give MTA, it would be TRIVIAL to use perl to write a script to 
always extract the exact IP address you need and then to use that 
information, and perls capabilities, to do whatever you want with that 
data.  This script could be deployed as either a part of the procmail 
space or to go through the MTA logs.  Give postfix capabilities, you can 
even add this in real time.

But considering the potential resource loads of additional statistical 
processes of ESF, markovian, and other methods involving token vectors; 
  the probability that bayesian might start to play with that; the fact 
that it will be potentially be quite the curve to develop and debug, I 
would be wary of adding anything to the system that isn't directed 
towards the original goal of being fast and accurate at filtering spam.

After all, it's still a very worthy goal and we have not yet reached 
100.000% accuracy.