lexer charsets

David Relson relson at osagesoftware.com
Sun Nov 3 15:21:26 CET 2002


Greetings,

I think I've found a way to speed up the lexer and make it charset 
sensitive as well.

Right now, the lexer uses function myfgets() to read in a line (with some 
processing for nuls and carriage returns).  The lexer then tokenizes the 
input using a complicated pattern that includes :blank:, :punct:, :cntrl:, 
and a string of special characters.  Lastly, get_token() does an upper-case 
to lower-case conversion.

The above process could be made faster by using a translation table.  A 
translation could convert all characters not allowed in tokens to 
spaces.  This would then simplify the token matching 
pattern.  The  translation table could handle control characters, 
punctuation, and special characters, like <>;=!@#%^&*(){}, as well as 
mapping upper case to lower case.

There are several things that are not presently handled.  I have seen some 
tokens with leading 0xA0 characters (spaces with the high bit turned on) 
where a space is wanted.  A single table entry can easily fix 
this.  Unreadable characters, for example asian characters, are currently 
passed through.  One way to handle them is to convert to question 
marks.  An option for this can (should?) be added to the configuration file.

Lastly, we have the whole arena of character sets.  The lexer could 
recognize "charset=xyz", identify it as a CHARSET token, and then call a 
charset initialization function to set up the translation table.

I'm planning on writing code to set this up.  As I'm unfamiliar with the 
specifics of the various character sets, e.g. german vs french vs greek, 
etc, I will leave that detailed work to those more interested in them and 
knowledgeable than I.

David





More information about the bogofilter-dev mailing list