charset implementtion progress
David Relson
relson at osagesoftware.com
Tue Nov 26 18:40:06 CET 2002
Greetings,
I've made some significant improvements and progress on supporting
character sets, a project I started several weeks ago.
The way I have implemented character sets is to spot "charset=name" in the
message and initialize some character translation maps so that the lexer's
task can be simpler, faster, and more accurate. As the message is read
from stdin, it is saved (for passthrough, if necessary), _then_ modified
according to the charset table, then processed by the lexer. Doing it this
way, what the user sees is totally unchanged (except for adding the
X-Bogosity line) and the lexer sees a set of characters that it can handle.
I've modified the lexer to recognize "charset=name", give it a distinct
token type, and call a function when it has one.
The function checks the character set name against a list of known charsets
and calls the initialization function for the charset. The initialization
function sets up two translation tables. One is used to map standard ascii
token separator (as currently recognized by the lexer) to a delimiter, i.e.
a space character. Doing this with a translation table allows the lexer
rules to be shorter and simpler, thus decreasing the lexer's size and
increasing its speed a bit. The second table is for upper case to lower
case conversion.
The default initialization routine, which is called before the charsets
real initializer, just deals with 7-bit ascii. It maps control character
and delimiting characters to spaces and maps uppercase to lower (A-Z to
a-z). It provides an identity map (no values change) for hi-bit characters
(0x80 to 0xFF).
Most of the charset initialization routines contain comment "Not yet
implemented" and to nothing besided the default initialization. As I don't
know enough about iso-8859-2 through iso-8859-15 to do the job properly,
I'm leaving the work for those who _do_ know more.
Regarding us-ascii, I have done a bit of work, though not too much. The
iso-8859-1 tables are currently the defaults. The us-ascii routine uses
the default initializations and then maps 0xA0, 0x92, 0x93, 0x94, and 0xA9
to space, single quote, or double quote (as appropriated).
With these routines in place, the regression test results have changed a
little bit. Since"iso-8859-1", "us-ascii", etc are now processed by the
got_charset() routine and are not passed on as tokens, some of bogofilter's
test now have one or two fewer tokens used in the spamicity
calculation. Also, the tests from bogofilter-0.9.0 which caused trouble on
hp-ux now works the same as the current tests - at least for linux. Allyn
will tell us soon enough if there are portability issues.
As far as I know, there are no other side-effects to these changes. They
provide a framework for fuller charset support without hindering or
breaking what we now have. I'm going to use them for privately for the
next couple of days and will add them to CVS after bogofilter-0.9.0.x is
promoted to "stable".
David
Of couse, if someone _really_ wants the new code, I can make a patch available.
More information about the bogofilter-dev
mailing list