foreign characters and words

Clint Adams schizo at debian.org
Sat Sep 21 21:32:41 CEST 2002


This lex line
[A-Za-z$][A-Za-z0-9$'.-]+[A-Za-z0-9$]           {return(TOKEN);}

doesn't handle non-ASCII words very well.  For example, the German
einigermaßen gets parsed as the two words "einigerma" and "en".  The
latter is dropped as 2-characters.

Since bogofilter doesn't pay any attention to charset encodings, and
flex wouldn't care if it did, I suggested this line:

[^[:blank:]\n[:digit:]'.-][^[:blank:]\n]+[^[:blank:]\n[:digit:]'.-] {return(TOKEN);}


It seems to work well for ISO-8859-1 messages, but is potentially less
useful for East Asian languages, where words are strung together without
spacing.  Any ideas?



More information about the bogofilter-dev mailing list