lexer charsets

Matthias Andree matthias.andree at gmx.de
Sun Nov 3 18:48:02 CET 2002


David Relson <relson at osagesoftware.com> writes:

> Lastly, we have the whole arena of character sets.  The lexer could
> recognize "charset=xyz", identify it as a CHARSET token, and then call a
> charset initialization function to set up the translation table.
>
> I'm planning on writing code to set this up.  As I'm unfamiliar with the
> specifics of the various character sets, e.g. german vs french vs greek,
> etc, I will leave that detailed work to those more interested in them
> and knowledgeable than I.

I recommend against this. We need a MIME parser first (we can steal one
from one or another project, or use an existing RFC-2045 library) before
we can do anything like this. A lexer won't do. Then, we need to
canonicalize the input character set to a common character set that we
use for our persistent token lists (the DB's). A mime part will be
perfectly valid without character set declaration (the implied character
set is US-ASCII), but you cannot simply reset this character set at the
next boundary line: it might not match the corresponding boundary
parameter.

We're not getting anywhere without that.

(Other than that, we can define a break characters rule in lexer.l, like
we did for BASE64, to get a more concise lexer.l. I don't believe we're
really much faster when using is*-style functions -- standard library or
our own -- in yyinput rather than letting lex treat this.)

-- 
Matthias Andree



More information about the bogofilter-dev mailing list