Cyrillic issues in 0.94.12

Tue May 24 17:03:50 CEST 2005

On Mon, May 23, 2005 at 06:01:00PM -0400, David Relson wrote:
> On Mon, 23 May 2005 18:27:21 +0400
> 
> Your patch for charset.c has been applied and is now in CVS.  You'll
> also want the attached patch for lexer.c.
> 
> The CP866 patch originated with a Russian speaker and I applied it to
> bogofilter.  Looking at it, it seems to focus on using CP866 rather
> than KOI8-R.

Thanks!

> I'm perfectly willing to change '--enable-russian' to '--enable-
> cyrillic' or '--enable-cp866', whatever is most meaningful to people
> would use it.  As my knowledge of languages and charsets is limited,
> I'm not the best person to name the option.
> 
> There's also an "--with-charset=..." option for configure that may
> be of use to you.  Are you aware of it?  Possibly the configure line
> below may help:
> 
>    ./configure --enable-russian --with-charset=koi8-r

AFAIK, `--with-charset=...' specifies a charset to assume, should
an email have no charset specified explicitly in its MIME headers,
doesn't it?  This is not exactly the same as what I meant.  Let me
explain my vision of the problem in detail.

The whole issue of languages, charsets, and encodings is quite
simple as long as there are not greater than 128 characters besides
Roman ones in a national alphabet; so it's possible to generalize
the current state of affairs.

There are groups of human written languages that share a single
alphabet.  As soon as people speaking such languages get to using
computers, they have to encode their national characters as digits.
Pure Roman characters are encoded now according to US-ASCII, which
leaves codes from 128 through 255 for national characters. (Leave
alone multi-byte encodings.)

Unfortunately, a partucular alphabet is often encoded in more than
one standard way for technical, political, or historical reasons.
For example, there are not less than six Czech encodings in use
despite the Czech alphabet is Roman-based with some accented
characters added.

Another example is Cyrillic.  Strictly speaking, the full set of
Cyrillic characters in use is rather large (yet by far less than
128 chars) and differs more or less from one language to another.
E.g., the Russian alphabet is a superset of the Bulgarian alphabet
while the Ukrainian alphabet has a major intersection with the
former two.  Due to apparent historical reasons, the first Cyrillic
encodings to appear were suited for Russian.  To the best of my
knowledge, every modern encoding taking into account other Cyrillic
alphabets as well is based on a certain encoding specific to Russian
and so the former is backwards compatible with the latter.  Perhaps
this is why it is assumed often that Russian == Cyrillic when it
comes to encodings.

All my discourse boils down to the following.  It has been more or
less agreed on this list that processing tokens in a language having
more than one encoding will benefit from converting digital
representation of such tokens to a single pre-configured encoding.
When bogofilter gets on input an email with its charset encoding
specified in MIME headers, bogofilter can see if that encoding
should be converted to another one for the sake of wordlist compactness
and better spam/ham detection.  This is what the proposed prototype
option `--with-LANGGROUP=CHARSET' is for, e.g., `--with-cyrillic=koi8-r'.
Of course, each of such language groups has to be supported by proper code.
Additionally, `--with-charset=CHARSET' can tell bogofilter to treat
emails with unspecified encoding as encoded according to CHARSET.
This is why these options are complementary to each other.

> If that's insufficient, feel free to experiment with the unicode.  It's
> experimental, but feedback and patches are always appreciated.  And if
> _that's_ still not enough, feel free to submit additional patches.

Unicode is a great thing, but as it was already noted here some
time ago, Unicode will become really usable not sooner than Unix
OSes get full support for it in screen drivers, user software etc.
Until then, using national encodings is more convenient since a
user can read the tokens from the terminal, debug wordlists etc.

-- 
Yar