Cyrillic issues in 0.94.12
David Relson
relson at osagesoftware.com
Tue May 24 23:03:03 CEST 2005
On Tue, 24 May 2005 19:03:50 +0400
Yar Tikhiy wrote:
> On Mon, May 23, 2005 at 06:01:00PM -0400, David Relson wrote:
> > On Mon, 23 May 2005 18:27:21 +0400
> >
> > Your patch for charset.c has been applied and is now in CVS. You'll
> > also want the attached patch for lexer.c.
> >
> > The CP866 patch originated with a Russian speaker and I applied it to
> > bogofilter. Looking at it, it seems to focus on using CP866 rather
> > than KOI8-R.
>
> Thanks!
>
> > I'm perfectly willing to change '--enable-russian' to '--enable-
> > cyrillic' or '--enable-cp866', whatever is most meaningful to people
> > would use it. As my knowledge of languages and charsets is limited,
> > I'm not the best person to name the option.
> >
> > There's also an "--with-charset=..." option for configure that may
> > be of use to you. Are you aware of it? Possibly the configure line
> > below may help:
> >
> > ./configure --enable-russian --with-charset=koi8-r
>
> AFAIK, `--with-charset=...' specifies a charset to assume, should
> an email have no charset specified explicitly in its MIME headers,
> doesn't it? This is not exactly the same as what I meant. Let me
> explain my vision of the problem in detail.
>
> The whole issue of languages, charsets, and encodings is quite
> simple as long as there are not greater than 128 characters besides
> Roman ones in a national alphabet; so it's possible to generalize
> the current state of affairs.
>
> There are groups of human written languages that share a single
> alphabet. As soon as people speaking such languages get to using
> computers, they have to encode their national characters as digits.
> Pure Roman characters are encoded now according to US-ASCII, which
> leaves codes from 128 through 255 for national characters. (Leave
> alone multi-byte encodings.)
>
> Unfortunately, a partucular alphabet is often encoded in more than
> one standard way for technical, political, or historical reasons.
> For example, there are not less than six Czech encodings in use
> despite the Czech alphabet is Roman-based with some accented
> characters added.
>
> Another example is Cyrillic. Strictly speaking, the full set of
> Cyrillic characters in use is rather large (yet by far less than
> 128 chars) and differs more or less from one language to another.
> E.g., the Russian alphabet is a superset of the Bulgarian alphabet
> while the Ukrainian alphabet has a major intersection with the
> former two. Due to apparent historical reasons, the first Cyrillic
> encodings to appear were suited for Russian. To the best of my
> knowledge, every modern encoding taking into account other Cyrillic
> alphabets as well is based on a certain encoding specific to Russian
> and so the former is backwards compatible with the latter. Perhaps
> this is why it is assumed often that Russian == Cyrillic when it
> comes to encodings.
>
> All my discourse boils down to the following. It has been more or
> less agreed on this list that processing tokens in a language having
> more than one encoding will benefit from converting digital
> representation of such tokens to a single pre-configured encoding.
> When bogofilter gets on input an email with its charset encoding
> specified in MIME headers, bogofilter can see if that encoding
> should be converted to another one for the sake of wordlist compactness
> and better spam/ham detection. This is what the proposed prototype
> option `--with-LANGGROUP=CHARSET' is for, e.g., `--with-cyrillic=koi8-r'.
> Of course, each of such language groups has to be supported by proper code.
> Additionally, `--with-charset=CHARSET' can tell bogofilter to treat
> emails with unspecified encoding as encoded according to CHARSET.
> This is why these options are complementary to each other.
>
> > If that's insufficient, feel free to experiment with the unicode. It's
> > experimental, but feedback and patches are always appreciated. And if
> > _that's_ still not enough, feel free to submit additional patches.
>
> Unicode is a great thing, but as it was already noted here some
> time ago, Unicode will become really usable not sooner than Unix
> OSes get full support for it in screen drivers, user software etc.
> Until then, using national encodings is more convenient since a
> user can read the tokens from the terminal, debug wordlists etc.
>
> --
> Yar
Hi Yar,
Your explanation clarifies matters greatly. Without it, I didn't
understand what your idea, but now it makes sense. Also, rewriting the
idea "--with-LANGGROUP=CHARSET" is helpful.
Have you any idea how much you'd need to change? The idea sounds fine
to me.
As a question to you list readers who are multi-lingual, would Yar's
idea help you?
Regards,
David
More information about the bogofilter-dev
mailing list