Much simplified lexer

Matthias Andree matthias.andree at gmx.de
Fri Nov 14 01:30:32 CET 2003


On Fri, 14 Nov 2003, michael at optusnet.com.au wrote:

> > > assumed that version:
> > > <INITIAL>E?SMTP{WHITESPACE}+{WHITESPACE}id{ID}
> > 
> > We may just want to drop it altogether. If we want to drop "constant"
> > parts, say, "constant" Received: or Delivered-To: lines, it'd be better
> > to strip off the first N Received: lines.
> 
> Terrible idea. The Received lines are a very rich source for significant
> tokens for me. :)

I don't see it as a bad idea. The goal of these rules is to
reduce the count of unique tokens in the data base that aren't
indicative for spam, in other words: avoid ballast.

Locally-generated Received headers can be near-unique (they aren't on
some systems, inode numbers can be recycled for instance, these are only
unique for a given point in time), but they are not sent by the spammer,
and they are at least not surprising to the end user, hence they (the
locally generated headers at large) carry no entropy, they contain no
information.

Theoretically, the entropy H is defined as negative logarithmus dualis
of the probability of occurrance; H(x) := - ld p(x).  In our case,
p(local Received) == 1, hence H(local Received) = 0. We can discard such
tokens at will without sacrificing accuracy. (Makes me wonder if the
entropy should be used as an additional weighting factor OR,
alternatively, to figure if a mail is spam, to prefer tokens with high
entropy, some research would be needed on that. I haven't checked
existing publications though.)

Back to our question: Received: headers are ordered, with new ones are
inserted at the beginning. I am not aware of software that gets this
wrong. There may be software that does not emit a Received: header
though.

All mails that I receive have three locally-generated Received: headers,
one from my upstream POP3 server, one from fetchmail, one from my local
Postfix. Discarding these three lines will not discard information that
had been present at the originator's (spammer's) site.

-- 
Matthias Andree

Encrypt your mail: my GnuPG key ID is 0x052E7D95




More information about the Bogofilter mailing list