linear whitespace [was: RFC-2047 & encoded QP text]

Mon Jul 28 15:20:44 CEST 2003

At 03:40 AM 7/28/03, Matthias Andree wrote:

>An encoded word does not continue past the end of the line, this must be
>accounted for. We will also need to take care that we remove linear
>white space between two encoded words, so that:
>
>Summary: =?ISO-8859-1?Q?Regen?=
>   =?ISO-8859-1?Q?w=FCrmer=3F?=
>
>yields {Summary; Regenwürmer} rather than {Summary; Regen; würmer}.
>
>This is necessary so spammers can't split up their tokens at will to
>hide them from bogofilter's view.
>
>I think we'll have to move the RFC-2047 decoding out of the lexer.

Matthias,

Dealing with linear whitespace would be nice, but I don't think it's 
doable.  Spammers can already split their tokens at will and hide them from 
bogofilter.  There's no way to tell when whitespace should be removed and 
when it shouldn't.  Consider the following:

Subject: viagra
Subject: via gra
Subject: v i a g r a

You and I can recognize them as the same.  Bogofilter parses them as 1 
word, 2 words, and no words.  Encoding the three lines just changes the 
appearance, not the content.

Consider the following:

Subject: =?ISO-8859-1?Q?linear?= =?ISO-8859-1?Q?whitespace?=

The current lexer gives "linear whitespace".  Removing spaces would, in 
this case, be wrong.

AFAICT embedded whitespace is a limitation that we must accept because we 
can't do anything about it.

David