RFC-2047

Wed Jul 23 03:35:02 CEST 2003

At 09:13 PM 7/22/03, Matthias Andree wrote:
>Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> writes:
>
> > But I don't see why the same word should show up several
> > times because of different codings.
>
>- Spam in different character sets, including falsely declared
>   ones. German-language spam comes undeclared, as ASCII, ISO-8859-1,
>   -15, Windows-1252. The same character sets are available for English,
>   Spanish and French.
>
> > Furher, we already discussed, that we cannot even tell what is
> > whitespace or punctuation if we don't understand the charset.
>
>True, but without such a developer or at least tester feedback, this
>isn't going to change. I'm not adding code that I cannot test and that I
>cannot have tested by somebody.

Matthias,

If bogofilter is to do more with charsets, we definitely need someone for 
testing. Being fluent only in english, I'm not the right person for this task.

I think what was suggested was to create a token of form 
"charset:decoded_text".  Thus if we assume encoding 't' means 'text/plain', 
from atom "=?iso-8859-1?t?junk?=" would come token "iso-8859-1:junk" and 
from "=?iso-8859-37?t?junk?=" would come token "iso-8859-37:junk".  Having 
bogofilter decode the text portion is sufficient, i think, and creating the 
special token is unnecessary.

What has been implemented is to simply decode the text portion (with 
charset being ignored).

Bogofilter has the beginnings of charset tables for doing character 
translations.  Flex has its own definitions of letters and numbers and 
bogofilter's parsing depends on those definitions.  Assuming it's possible, 
my idea for charset translation is to translate special characters such as 
accented vowels and consonants (which flex doesn't handle) to vowels and 
consonants which flex can reasonable handle.  With proper tables I think it 
possible for flex to handle all the European languages.

David