Serious problem with non-ASCII words
David Relson
relson at osagesoftware.com
Fri Sep 20 19:00:41 CEST 2002
<x-flowed>
At 11:28 AM 9/20/02, Boris 'pi' Piwinger wrote:
>Matthias Andree wrote:
>
> > Looks like the parser is broken. Since I know German, I shall have a
> > look.
>
>Thanks.
>
> > BTW, in case someone wonders about quoted-printable decoding, we could
> > go for reformime -r8 for now. reformime is part of the maildrop package,
> > available from http://www.flounder.net/~mrsam/maildrop/README.html
>
>On one hand this is nice. On the other it might be a bad idea to get
>more and more dependencies.
>
>Anothere problem may be multi-character-charsets as utf-8. So the same
>word will look different in ISO-8859-x and utf-8.
Currently lexer_l.l has two lines for recognizing tokens. They are:
{IPADDR} {return(TOKEN);}
[A-Za-z$][A-Za-z0-9$'.-]+[A-Za-z0-9$] {return(TOKEN);}
As can be seen, a "word" is a letter or dollar sign followed by letters,
digits, several special characters and ending with a letter, digit, or
dollar sign. As can be seen, non-english characters aren't being
handled. Also not included is the underscore ...
There's definitely room for a fix!
For summay digest subscription: bogofilter-digest-subscribe at aotto.com
</x-flowed>
More information about the Bogofilter
mailing list