Encoded tôkens of s ubjé ct with làrge têxt get splitted f róm MUA

David Relson relson at osagesoftware.com
Mon Aug 4 14:15:37 CEST 2003


At 03:23 AM 8/4/03, Boris 'pi' Piwinger wrote:
>Matthias Andree <matthias.andree at gmx.de> wrote:
>
> >> P.S.  My MUA doesn't get it quite right, either.  It has an extra space
> >> in its subject.
> >
> >No wonder. X-Mailer: QUALCOMM Windows Eudora Version 4.3.2
> >                             ---------
> >
> >This "can't do RFC-2047 right" is shared by more than one Windows
> >mailer, so I wonder if they all use the same broken "system"^WOE
> >library.
>
>And I have seen Readers under Linux which don't do it. I
>also have seen some which don't understand ISO-8859-1, let
>alone Unicode. But AFAICT the above version is outdated.

Indeed Eudora 4.3.2 is old.  As it works well enought for me and I've heard 
various bad reports about the 5.x releases, I'm chosen to stay with the 
version I have.

> >Back on topic, how can we attack the real problem? One issue is: what do
> >we do when adjacent encoded words don't have the same character set?
>
>As long as we don't care about the charset anyhow this is
>not a problem. We need to deal with it when we learn
>charsets.
>
> >We've seen with the case blindness that we've discarded useful
> >information, folding isn't desired. Folding all character sets into
> >UTF-8 discards information,
>
>I don't believe that if you take into account the other
>side. Not uniformizing introduces pseudoinformation, it
>stops showing the same word up as the same word. This is
>similar to removing comments in HTML â€" which hid away words
>before we did it.
>
>Here is what I think would be the proper way. I don't know
>if it can easily included into bogofilter:
>
>When parsing the header:
>1) Read subject line, if the next begins with whitespace,
>concatenate by removing \n. Repeat for additonal folded
>lines.
>2) Remove all whitespace between encoded words.
>3) Apply the lexer to that.

That is a workable plan.  It's somewhat awkward to scan and rescan 
lines.  If it was simpler, the change would have been made :-)





More information about the Bogofilter mailing list