lexer charsets
David Relson
relson at osagesoftware.com
Sun Nov 3 15:21:26 CET 2002
Greetings,
I think I've found a way to speed up the lexer and make it charset
sensitive as well.
Right now, the lexer uses function myfgets() to read in a line (with some
processing for nuls and carriage returns). The lexer then tokenizes the
input using a complicated pattern that includes :blank:, :punct:, :cntrl:,
and a string of special characters. Lastly, get_token() does an upper-case
to lower-case conversion.
The above process could be made faster by using a translation table. A
translation could convert all characters not allowed in tokens to
spaces. This would then simplify the token matching
pattern. The translation table could handle control characters,
punctuation, and special characters, like <>;=!@#%^&*(){}, as well as
mapping upper case to lower case.
There are several things that are not presently handled. I have seen some
tokens with leading 0xA0 characters (spaces with the high bit turned on)
where a space is wanted. A single table entry can easily fix
this. Unreadable characters, for example asian characters, are currently
passed through. One way to handle them is to convert to question
marks. An option for this can (should?) be added to the configuration file.
Lastly, we have the whole arena of character sets. The lexer could
recognize "charset=xyz", identify it as a CHARSET token, and then call a
charset initialization function to set up the translation table.
I'm planning on writing code to set this up. As I'm unfamiliar with the
specifics of the various character sets, e.g. german vs french vs greek,
etc, I will leave that detailed work to those more interested in them and
knowledgeable than I.
David
More information about the bogofilter-dev
mailing list