RFC-2047
David Relson
relson at osagesoftware.com
Wed Jul 23 03:35:02 CEST 2003
At 09:13 PM 7/22/03, Matthias Andree wrote:
>Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> writes:
>
> > But I don't see why the same word should show up several
> > times because of different codings.
>
>- Spam in different character sets, including falsely declared
> ones. German-language spam comes undeclared, as ASCII, ISO-8859-1,
> -15, Windows-1252. The same character sets are available for English,
> Spanish and French.
>
> > Furher, we already discussed, that we cannot even tell what is
> > whitespace or punctuation if we don't understand the charset.
>
>True, but without such a developer or at least tester feedback, this
>isn't going to change. I'm not adding code that I cannot test and that I
>cannot have tested by somebody.
Matthias,
If bogofilter is to do more with charsets, we definitely need someone for
testing. Being fluent only in english, I'm not the right person for this task.
I think what was suggested was to create a token of form
"charset:decoded_text". Thus if we assume encoding 't' means 'text/plain',
from atom "=?iso-8859-1?t?junk?=" would come token "iso-8859-1:junk" and
from "=?iso-8859-37?t?junk?=" would come token "iso-8859-37:junk". Having
bogofilter decode the text portion is sufficient, i think, and creating the
special token is unnecessary.
What has been implemented is to simply decode the text portion (with
charset being ignored).
Bogofilter has the beginnings of charset tables for doing character
translations. Flex has its own definitions of letters and numbers and
bogofilter's parsing depends on those definitions. Assuming it's possible,
my idea for charset translation is to translate special characters such as
accented vowels and consonants (which flex doesn't handle) to vowels and
consonants which flex can reasonable handle. With proper tables I think it
possible for flex to handle all the European languages.
David
More information about the Bogofilter
mailing list