Cyrillic issues in 0.94.12

Tue May 24 23:03:03 CEST 2005

On Tue, 24 May 2005 19:03:50 +0400
Yar Tikhiy wrote:

> On Mon, May 23, 2005 at 06:01:00PM -0400, David Relson wrote:
> > On Mon, 23 May 2005 18:27:21 +0400
> > 
> > Your patch for charset.c has been applied and is now in CVS.  You'll
> > also want the attached patch for lexer.c.
> > 
> > The CP866 patch originated with a Russian speaker and I applied it to
> > bogofilter.  Looking at it, it seems to focus on using CP866 rather
> > than KOI8-R.
> 
> Thanks!
>  
> > I'm perfectly willing to change '--enable-russian' to '--enable-
> > cyrillic' or '--enable-cp866', whatever is most meaningful to people
> > would use it.  As my knowledge of languages and charsets is limited,
> > I'm not the best person to name the option.
> > 
> > There's also an "--with-charset=..." option for configure that may
> > be of use to you.  Are you aware of it?  Possibly the configure line
> > below may help:
> > 
> >    ./configure --enable-russian --with-charset=koi8-r
> 
> AFAIK, `--with-charset=...' specifies a charset to assume, should
> an email have no charset specified explicitly in its MIME headers,
> doesn't it?  This is not exactly the same as what I meant.  Let me
> explain my vision of the problem in detail.
> 
> The whole issue of languages, charsets, and encodings is quite
> simple as long as there are not greater than 128 characters besides
> Roman ones in a national alphabet; so it's possible to generalize
> the current state of affairs.
> 
> There are groups of human written languages that share a single
> alphabet.  As soon as people speaking such languages get to using
> computers, they have to encode their national characters as digits.
> Pure Roman characters are encoded now according to US-ASCII, which
> leaves codes from 128 through 255 for national characters. (Leave
> alone multi-byte encodings.)
> 
> Unfortunately, a partucular alphabet is often encoded in more than
> one standard way for technical, political, or historical reasons.
> For example, there are not less than six Czech encodings in use
> despite the Czech alphabet is Roman-based with some accented
> characters added.
> 
> Another example is Cyrillic.  Strictly speaking, the full set of
> Cyrillic characters in use is rather large (yet by far less than
> 128 chars) and differs more or less from one language to another.
> E.g., the Russian alphabet is a superset of the Bulgarian alphabet
> while the Ukrainian alphabet has a major intersection with the
> former two.  Due to apparent historical reasons, the first Cyrillic
> encodings to appear were suited for Russian.  To the best of my
> knowledge, every modern encoding taking into account other Cyrillic
> alphabets as well is based on a certain encoding specific to Russian
> and so the former is backwards compatible with the latter.  Perhaps
> this is why it is assumed often that Russian == Cyrillic when it
> comes to encodings.
> 
> All my discourse boils down to the following.  It has been more or
> less agreed on this list that processing tokens in a language having
> more than one encoding will benefit from converting digital
> representation of such tokens to a single pre-configured encoding.
> When bogofilter gets on input an email with its charset encoding
> specified in MIME headers, bogofilter can see if that encoding
> should be converted to another one for the sake of wordlist compactness
> and better spam/ham detection.  This is what the proposed prototype
> option `--with-LANGGROUP=CHARSET' is for, e.g., `--with-cyrillic=koi8-r'.
> Of course, each of such language groups has to be supported by proper code.
> Additionally, `--with-charset=CHARSET' can tell bogofilter to treat
> emails with unspecified encoding as encoded according to CHARSET.
> This is why these options are complementary to each other.
> 
> > If that's insufficient, feel free to experiment with the unicode.  It's
> > experimental, but feedback and patches are always appreciated.  And if
> > _that's_ still not enough, feel free to submit additional patches.
> 
> Unicode is a great thing, but as it was already noted here some
> time ago, Unicode will become really usable not sooner than Unix
> OSes get full support for it in screen drivers, user software etc.
> Until then, using national encodings is more convenient since a
> user can read the tokens from the terminal, debug wordlists etc.
> 
> -- 
> Yar

Hi Yar,

Your explanation clarifies matters greatly.  Without it, I didn't
understand what your idea, but now it makes sense.  Also, rewriting the
idea "--with-LANGGROUP=CHARSET" is helpful.

Have you any idea how much you'd need to change?  The idea sounds fine
to me.

As a question to you list readers who are multi-lingual, would Yar's
idea help you?

Regards,

David