RFC-2047
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Wed Jul 23 08:33:53 CEST 2003
David Relson <relson at osagesoftware.com> wrote:
>I think what was suggested was to create a token of form
>"charset:decoded_text". Thus if we assume encoding 't' means 'text/plain',
>from atom "=?iso-8859-1?t?junk?=" would come token "iso-8859-1:junk" and
>from "=?iso-8859-37?t?junk?=" would come token "iso-8859-37:junk". Having
>bogofilter decode the text portion is sufficient, i think, and creating the
>special token is unnecessary.
Right, the above seems to unnecessary blow up the databse
while not seeing the same word as the same word. My
suggestion was to add a special token:
body_charset:iso-8859-1
>Bogofilter has the beginnings of charset tables for doing character
>translations. Flex has its own definitions of letters and numbers and
>bogofilter's parsing depends on those definitions. Assuming it's possible,
>my idea for charset translation is to translate special characters such as
>accented vowels and consonants (which flex doesn't handle) to vowels and
>consonants which flex can reasonable handle. With proper tables I think it
>possible for flex to handle all the European languages.
This is a nice idea, but limited to latin-based languages.
pi
More information about the Bogofilter
mailing list