Cyrillic issues in 0.94.12
David Relson
relson at osagesoftware.com
Tue May 24 00:01:00 CEST 2005
On Mon, 23 May 2005 18:27:21 +0400
Yar Tikhiy wrote:
> Hi there,
>
> I'm afraid that the work done this year on support for Cyrillic
> charsets has introduced some regression. The most obvious problem
> is that Cyrillic users no longer can have bogofilter convert words
> to KOI8-R encoding. A quick glance at src/charset.c reveals there
> is a problem in its ``#ifdef CP866'' logic; the function
> map_windows_1251_to_koi8r() is prototyped and used if CP866 IS NOT
> defined, yet its body is included in compilation if CP866 IS defined.
> I included below a quick fix for that. However, I'm uncertain about
> whether the original intention really was to do CP1251->KOI8-R
> conversion in Cyrillic mail by default.
>
> In addition, the current ``--enable-russian'' option isn't all good.
> First, Russian isn't the only language out there using a Cyrillic
> alphabet. This is not about political correctness, but about user
> confusion. Now users have right to assume that bogofilter knows
> peculiarities of the Russian grammar etc. Second, the build-time
> option in fact tells bogofilter to use a particular encoding, CP866,
> in processing Cyrillic words. This can hardly be deduced from the
> option's name. Moreover, there are users of previous bogofilter
> versions that already have their wordlists stored in KOI8-R encoding.
> Should they try the new ``--enable-russian'' option, their confusion
> will increase even more.
>
> As long as we still are on the way to Unicode and so we have to
> special-case different alphabets, each having several encodings,
> I'd suggest options like ``--with-cyrillic=encoding'' telling
> bogofilter to do processing national mail in a particular supported
> encoding if possible. The job for the Cyrillic case looks rather
> easy, so I'll volunteer for it if we agree here that it's the right
> thing to do for now.
>
> The quick fix follows.
>
...[snip]...
>
> --
> Yar
Hello Yar,
Your patch for charset.c has been applied and is now in CVS. You'll
also want the attached patch for lexer.c.
The CP866 patch originated with a Russian speaker and I applied it to
bogofilter. Looking at it, it seems to focus on using CP866 rather
than KOI8-R.
I'm perfectly willing to change '--enable-russian' to '--enable-
cyrillic' or '--enable-cp866', whatever is most meaningful to people
would use it. As my knowledge of languages and charsets is limited,
I'm not the best person to name the option.
There's also an "--with-charset=..." option for configure that may
be of use to you. Are you aware of it? Possibly the configure line
below may help:
./configure --enable-russian --with-charset=koi8-r
If that's insufficient, feel free to experiment with the unicode. It's
experimental, but feedback and patches are always appreciated. And if
_that's_ still not enough, feel free to submit additional patches.
As you're aware, language/charset support in bogofilter is skeletal.
Code contributions from knowledgeable folks is always welcome!
David
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch.0523.lexer.c
URL: <https://www.bogofilter.org/pipermail/bogofilter-dev/attachments/20050523/4d418246/attachment.c>
More information about the bogofilter-dev
mailing list