lexer_l.l: umlauts/national characters

Matthias Andree matthias.andree at gmx.de
Fri Sep 20 17:55:35 CEST 2002


Hi,

Boris Piwinger reported troubles with 8bit characters (German umlauts).
It looks as though the offending line of lexer_l.l was line #278:

[A-Za-z$][A-Za-z0-9$'.-]+[A-Za-z0-9$]»··»·······{return(TOKEN);}

This will not detect national characters. The trivial approach would be
to list the national characters here, but this won't work out, how do
you know if Turkish, German, Finnish is right? I thought about [:print:]
or [:alnum:] and setting a proper locale, but does anybody know how
portable this would be? My flex manual does not tell anything about
locale. The "SUSv3" tells us that input and output be localized, but the
lex source itself is in the POSIX locale. Whatever that means.

It looks as though there is no easy solution to this problem. Anyone got
some decent idea?

Also note that a national character could start a word, as in
"Überraschung" (<German> surprise) or in "Østfold fylke" (<Norwegian>
Eastfold region) or "Ålesund" (Norwegian city).

-- 
Matthias Andree



More information about the bogofilter-dev mailing list