bogofilter-SA-2004-01

Fri Nov 5 01:05:45 CET 2004

On Thu, 04 Nov 2004, .rp wrote:

> I apologize in advance for asking, but why not 
> 	set a flag when an encoded word is encountered 
> 	if LF comes up replace it with ~ 
> 	when encoded word ends set flag off.

No need to apologize for good questions.

An encoded word as per RFC-2047 does not contain line feed characters,
so we should not accept or attempt to decode them.

In fact, if we decoded a nonconformant string that closely resembles an
encoded word, we'd discard that bit of information that there was a
broken RFC-2047 encoded word. The decoded word carries no such entropy.

Example:

An intact RFC-2047 encoded word such as

    Test-Header: =?iso-8859-1?q?n=E4h_b=e4h?=

yields

    get_token: 1 "head:Test-Header"
    get_token: 1 "head:näh"
    get_token: 1 "head:bäh"

Test-Header: =?iso-8859-1?q?n=E4h b=e4h?= (same with embedded LF)

yields:

    get_token: 1 "head:Test-Header"
    get_token: 1 "head:E4h"
    get_token: 1 "head:e4h"

So you see the difference, the head:E4h and head:e4h tokens might be
registered as spam and be a rather strong indicator in the future as
these strings aren't commonly seen in headers. If we tolerated the
blank, we'd get the same result as in the "intact case", where it loses
entropy and hence significance for the process of telling spam from ham.

HTH,

-- 
Matthias Andree