Serious problem with non-ASCII words

Fri Sep 20 19:00:41 CEST 2002

<x-flowed>
At 11:28 AM 9/20/02, Boris 'pi' Piwinger wrote:
>Matthias Andree wrote:
>
> > Looks like the parser is broken. Since I know German, I shall have a
> > look.
>
>Thanks.
>
> > BTW, in case someone wonders about quoted-printable decoding, we could
> > go for reformime -r8 for now. reformime is part of the maildrop package,
> > available from http://www.flounder.net/~mrsam/maildrop/README.html
>
>On one hand this is nice. On the other it might be a bad idea to get
>more and more dependencies.
>
>Anothere problem may be multi-character-charsets as utf-8. So the same
>word will look different in ISO-8859-x and utf-8.

Currently lexer_l.l has two lines for recognizing tokens.  They are:

{IPADDR}                                        {return(TOKEN);}
[A-Za-z$][A-Za-z0-9$'.-]+[A-Za-z0-9$]           {return(TOKEN);}

As can be seen, a "word" is a letter or dollar sign followed by letters, 
digits, several special characters and ending with a letter, digit, or 
dollar sign.  As can be seen, non-english characters aren't being 
handled.  Also not included is the underscore ...

There's definitely room for a fix!

For summay digest subscription: bogofilter-digest-subscribe at aotto.com

</x-flowed>