Solutions for the charset issue

Wed Sep 25 10:16:17 CEST 2002

Hi!

We discussed the issue of non-ASCII-words a few days ago.

Now I had a closer look at some charsets, find the tables at:
http://www.kostis.net/charsets/

Clearly, we will fail badly for all charsets which are not ASCII
compatible, i.e., which don't agree on characters printable characters
below 128 (all numbers in this mail are decimal). Such charsets
include the ISO-646 family, EBCDIC and others. As it stand now, we
don't look at charsets at all, so there is nothing we can do about it now.

UTF-8 will also cause a headache.

So lets stick with the following charsets for now:
ISO-8859-x, cp125x

Let me first look at ISO-8859-1. My suggestion would be to define a
word not by the sub-charset a word consists of, but by the opposite,
i.e., characters which are word boundaries. Clearly, line endings come
into play. I'd suugest the following characters:
<=44, 46, 47, 58-64, 91-94, 123-161, 166, 171-173, 187, 191, 247

I have no stron opintion on 45 (hyphen) and 95 (low line)

Now we have to see, if this is compatible with other charsets. All
ISO-8859-x charsets will work nicely below 128. Bug already -2 fails
for 161.

cp1252 has some alphabetic characters at 128 (Euro sign), 138 (LATIN
CAPITAL LETTER S WITH CARON), 140 (LATIN CAPITAL LIGATURE OE) etc.

This shows, that my suggeston was to brave. So what to do? We
certainly can agree on the ASCII part. But there a non-standard
quotation marks above (like 147 and 148 in cp1252, which are
non-printable in iso, so this is fine).

So the question is: Can somebody come up with a set of characters
which are
a) always not part of words and
b) capture enough to separate words from punctuation and other words?

pi

For summay digest subscription: bogofilter-digest-subscribe at aotto.com