lexer charsets
Matthias Andree
matthias.andree at gmx.de
Sun Nov 3 18:48:02 CET 2002
David Relson <relson at osagesoftware.com> writes:
> Lastly, we have the whole arena of character sets. The lexer could
> recognize "charset=xyz", identify it as a CHARSET token, and then call a
> charset initialization function to set up the translation table.
>
> I'm planning on writing code to set this up. As I'm unfamiliar with the
> specifics of the various character sets, e.g. german vs french vs greek,
> etc, I will leave that detailed work to those more interested in them
> and knowledgeable than I.
I recommend against this. We need a MIME parser first (we can steal one
from one or another project, or use an existing RFC-2045 library) before
we can do anything like this. A lexer won't do. Then, we need to
canonicalize the input character set to a common character set that we
use for our persistent token lists (the DB's). A mime part will be
perfectly valid without character set declaration (the implied character
set is US-ASCII), but you cannot simply reset this character set at the
next boundary line: it might not match the corresponding boundary
parameter.
We're not getting anywhere without that.
(Other than that, we can define a break characters rule in lexer.l, like
we did for BASE64, to get a more concise lexer.l. I don't believe we're
really much faster when using is*-style functions -- standard library or
our own -- in yyinput rather than letting lex treat this.)
--
Matthias Andree
More information about the bogofilter-dev
mailing list