Encoded tôkens of subjé ct with làrge têxt get splitted fróm MUA

Mon Aug 4 09:23:27 CEST 2003

Matthias Andree <matthias.andree at gmx.de> wrote:

>> P.S.  My MUA doesn't get it quite right, either.  It has an extra space
>> in its subject.
>
>No wonder. X-Mailer: QUALCOMM Windows Eudora Version 4.3.2
>                             ---------
>
>This "can't do RFC-2047 right" is shared by more than one Windows
>mailer, so I wonder if they all use the same broken "system"^WOE
>library.

And I have seen Readers under Linux which don't do it. I
also have seen some which don't understand ISO-8859-1, let
alone Unicode. But AFAICT the above version is outdated.

>Back on topic, how can we attack the real problem? One issue is: what do
>we do when adjacent encoded words don't have the same character set?

As long as we don't care about the charset anyhow this is
not a problem. We need to deal with it when we learn
charsets.

>We've seen with the case blindness that we've discarded useful
>information, folding isn't desired. Folding all character sets into
>UTF-8 discards information, 

I don't believe that if you take into account the other
side. Not uniformizing introduces pseudoinformation, it
stops showing the same word up as the same word. This is
similar to removing comments in HTML — which hid away words
before we did it.

Here is what I think would be the proper way. I don't know
if it can easily included into bogofilter:

When parsing the header:
1) Read subject line, if the next begins with whitespace,
concatenate by removing \n. Repeat for additonal folded
lines.
2) Remove all whitespace between encoded words.
3) Apply the lexer to that.

pi