charset implementtion progress

Tue Nov 26 18:40:06 CET 2002

Greetings,

I've made some significant improvements and progress on supporting 
character sets, a project I started several weeks ago.

The way I have implemented character sets is to spot "charset=name" in the 
message and initialize some character translation maps so that the lexer's 
task can be simpler, faster, and more accurate.  As the message is read 
from stdin, it is saved (for passthrough, if necessary), _then_ modified 
according to the charset table, then processed by the lexer.  Doing it this 
way, what the user sees is totally unchanged (except for adding the 
X-Bogosity line) and the lexer sees a set of characters that it can handle.

I've modified the lexer to recognize "charset=name", give it a distinct 
token type, and call a function when it has one.

The function checks the character set name against a list of known charsets 
and calls the initialization function for the charset.  The initialization 
function sets up two translation tables.  One is used to map standard ascii 
token separator (as currently recognized by the lexer) to a delimiter, i.e. 
a space character.  Doing this with a translation table allows the lexer 
rules to be shorter and simpler, thus decreasing the lexer's size and 
increasing its speed a bit.  The second table is for upper case to lower 
case conversion.

The default initialization routine, which is called before the charsets 
real initializer, just deals with 7-bit ascii.  It maps control character 
and delimiting characters to spaces and maps uppercase to lower (A-Z to 
a-z).  It provides an identity map (no values change) for hi-bit characters 
(0x80 to 0xFF).

Most of the charset initialization routines contain comment "Not yet 
implemented" and to nothing besided the default initialization.  As I don't 
know enough about iso-8859-2 through iso-8859-15 to do the job properly, 
I'm leaving the work for those who _do_ know more.

Regarding us-ascii, I have done a bit of work, though not too much.  The 
iso-8859-1 tables are currently the defaults.  The us-ascii routine uses 
the default initializations and then maps 0xA0, 0x92, 0x93, 0x94, and 0xA9 
to space, single quote, or double quote (as appropriated).

With these routines in place, the regression test results have changed a 
little bit.  Since"iso-8859-1", "us-ascii", etc are now processed by the 
got_charset() routine and are not passed on as tokens, some of bogofilter's 
test now have one or two fewer tokens used in the spamicity 
calculation.  Also, the tests from bogofilter-0.9.0 which caused trouble on 
hp-ux now works the same as the current tests - at least for linux.  Allyn 
will tell us soon enough if there are portability issues.

As far as I know, there are no other side-effects to these changes.  They 
provide a framework for fuller charset support without hindering or 
breaking what we now have.  I'm going to use them for privately for the 
next couple of days and will add them to CVS after bogofilter-0.9.0.x is 
promoted to "stable".

David

Of couse, if someone _really_ wants the new code, I can make a patch available.