Cyrillic issues in 0.94.12

David Relson relson at osagesoftware.com
Tue May 24 00:01:00 CEST 2005


On Mon, 23 May 2005 18:27:21 +0400
Yar Tikhiy wrote:

> Hi there,
> 
> I'm afraid that the work done this year on support for Cyrillic
> charsets has introduced some regression.  The most obvious problem
> is that Cyrillic users no longer can have bogofilter convert words
> to KOI8-R encoding.  A quick glance at src/charset.c reveals there
> is a problem in its ``#ifdef CP866'' logic; the function
> map_windows_1251_to_koi8r() is prototyped and used if CP866 IS NOT
> defined, yet its body is included in compilation if CP866 IS defined.
> I included below a quick fix for that.  However, I'm uncertain about
> whether the original intention really was to do CP1251->KOI8-R
> conversion in Cyrillic mail by default.
> 
> In addition, the current ``--enable-russian'' option isn't all good.
> First, Russian isn't the only language out there using a Cyrillic
> alphabet.  This is not about political correctness, but about user
> confusion.  Now users have right to assume that bogofilter knows
> peculiarities of the Russian grammar etc.  Second, the build-time
> option in fact tells bogofilter to use a particular encoding, CP866,
> in processing Cyrillic words.  This can hardly be deduced from the
> option's name.  Moreover, there are users of previous bogofilter
> versions that already have their wordlists stored in KOI8-R encoding.
> Should they try the new ``--enable-russian'' option, their confusion
> will increase even more.
> 
> As long as we still are on the way to Unicode and so we have to
> special-case different alphabets, each having several encodings,
> I'd suggest options like ``--with-cyrillic=encoding'' telling
> bogofilter to do processing national mail in a particular supported
> encoding if possible.  The job for the Cyrillic case looks rather
> easy, so I'll volunteer for it if we agree here that it's the right
> thing to do for now.
> 
> The quick fix follows.
> 
...[snip]...
> 
> -- 
> Yar

Hello Yar,

Your patch for charset.c has been applied and is now in CVS.  You'll
also want the attached patch for lexer.c.

The CP866 patch originated with a Russian speaker and I applied it to
bogofilter.  Looking at it, it seems to focus on using CP866 rather
than KOI8-R.


I'm perfectly willing to change '--enable-russian' to '--enable-
cyrillic' or '--enable-cp866', whatever is most meaningful to people
would use it.  As my knowledge of languages and charsets is limited,
I'm not the best person to name the option.

There's also an "--with-charset=..." option for configure that may
be of use to you.  Are you aware of it?  Possibly the configure line
below may help:

   ./configure --enable-russian --with-charset=koi8-r

If that's insufficient, feel free to experiment with the unicode.  It's
experimental, but feedback and patches are always appreciated.  And if
_that's_ still not enough, feel free to submit additional patches.

As you're aware, language/charset support in bogofilter is skeletal.
Code contributions from knowledgeable folks is always welcome!

David
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch.0523.lexer.c
URL: <https://www.bogofilter.org/pipermail/bogofilter-dev/attachments/20050523/4d418246/attachment.c>


More information about the bogofilter-dev mailing list