foreign characters and words
Clint Adams
schizo at debian.org
Sat Sep 21 21:32:41 CEST 2002
This lex line
[A-Za-z$][A-Za-z0-9$'.-]+[A-Za-z0-9$] {return(TOKEN);}
doesn't handle non-ASCII words very well. For example, the German
einigermaßen gets parsed as the two words "einigerma" and "en". The
latter is dropped as 2-characters.
Since bogofilter doesn't pay any attention to charset encodings, and
flex wouldn't care if it did, I suggested this line:
[^[:blank:]\n[:digit:]'.-][^[:blank:]\n]+[^[:blank:]\n[:digit:]'.-] {return(TOKEN);}
It seems to work well for ISO-8859-1 messages, but is potentially less
useful for East Asian languages, where words are strung together without
spacing. Any ideas?
More information about the bogofilter-dev
mailing list