linear whitespace [was: RFC-2047 & encoded QP text]

Mon Jul 28 15:30:24 CEST 2003

David Relson wrote:

>>An encoded word does not continue past the end of the line, this must be
>>accounted for. We will also need to take care that we remove linear
>>white space between two encoded words, so that:
>>
>>Summary: =?ISO-8859-1?Q?Regen?=
>>   =?ISO-8859-1?Q?w=FCrmer=3F?=
>>
>>yields {Summary; Regenwürmer} rather than {Summary; Regen; würmer}.

> Dealing with linear whitespace would be nice, but I don't think it's 
> doable.  Spammers can already split their tokens at will and hide them from 
> bogofilter.  There's no way to tell when whitespace should be removed and 
> when it shouldn't.  Consider the following:
> 
> Subject: viagra
> Subject: via gra
> Subject: v i a g r a

Matthias talked about white space between encoded words. I
don't really know how the lexer works, but this algorithm
would be correct:

1) Find the complete header field (i.e., concatenate lines
folded by removing the \n).

2) Remove all linear white space between encoded words (and
only betwenn encoded words).

3) Decode encoded words, no matter if they are separated
correctly (i.e., by linear white space).

> You and I can recognize them as the same.  Bogofilter parses them as 1 
> word, 2 words, and no words.  Encoding the three lines just changes the 
> appearance, not the content.
> 
> Consider the following:
> 
> Subject: =?ISO-8859-1?Q?linear?= =?ISO-8859-1?Q?whitespace?=
> 
> The current lexer gives "linear whitespace".  Removing spaces would, in 
> this case, be wrong.

No, it would be correct. The RfC says that linear white
space between encoded words must be ignored.

pi