RFC-2047

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Wed Jul 23 08:33:53 CEST 2003


David Relson <relson at osagesoftware.com> wrote:

>I think what was suggested was to create a token of form 
>"charset:decoded_text".  Thus if we assume encoding 't' means 'text/plain', 
>from atom "=?iso-8859-1?t?junk?=" would come token "iso-8859-1:junk" and 
>from "=?iso-8859-37?t?junk?=" would come token "iso-8859-37:junk".  Having 
>bogofilter decode the text portion is sufficient, i think, and creating the 
>special token is unnecessary.

Right, the above seems to unnecessary blow up the databse
while not seeing the same word as the same word. My
suggestion was to add a special token:
body_charset:iso-8859-1

>Bogofilter has the beginnings of charset tables for doing character 
>translations.  Flex has its own definitions of letters and numbers and 
>bogofilter's parsing depends on those definitions.  Assuming it's possible, 
>my idea for charset translation is to translate special characters such as 
>accented vowels and consonants (which flex doesn't handle) to vowels and 
>consonants which flex can reasonable handle.  With proper tables I think it 
>possible for flex to handle all the European languages.

This is a nice idea, but limited to latin-based languages.

pi




More information about the Bogofilter mailing list