Cyrillic issues in 0.94.12

Yar Tikhiy yar at comp.chem.msu.su
Mon May 23 16:27:21 CEST 2005


Hi there,

I'm afraid that the work done this year on support for Cyrillic
charsets has introduced some regression.  The most obvious problem
is that Cyrillic users no longer can have bogofilter convert words
to KOI8-R encoding.  A quick glance at src/charset.c reveals there
is a problem in its ``#ifdef CP866'' logic; the function
map_windows_1251_to_koi8r() is prototyped and used if CP866 IS NOT
defined, yet its body is included in compilation if CP866 IS defined.
I included below a quick fix for that.  However, I'm uncertain about
whether the original intention really was to do CP1251->KOI8-R
conversion in Cyrillic mail by default.

In addition, the current ``--enable-russian'' option isn't all good.
First, Russian isn't the only language out there using a Cyrillic
alphabet.  This is not about political correctness, but about user
confusion.  Now users have right to assume that bogofilter knows
peculiarities of the Russian grammar etc.  Second, the build-time
option in fact tells bogofilter to use a particular encoding, CP866,
in processing Cyrillic words.  This can hardly be deduced from the
option's name.  Moreover, there are users of previous bogofilter
versions that already have their wordlists stored in KOI8-R encoding.
Should they try the new ``--enable-russian'' option, their confusion
will increase even more.

As long as we still are on the way to Unicode and so we have to
special-case different alphabets, each having several encodings,
I'd suggest options like ``--with-cyrillic=encoding'' telling
bogofilter to do processing national mail in a particular supported
encoding if possible.  The job for the Cyrillic case looks rather
easy, so I'll volunteer for it if we agree here that it's the right
thing to do for now.

The quick fix follows.

--- charset.c.orig	Wed Apr  6 03:31:29 2005
+++ charset.c	Mon May 23 14:03:01 2005
@@ -277,9 +277,9 @@
     /* Not yet implemented */
 }
 
+#ifndef	CP866
 static void map_windows_1251_to_koi8r(void)
 {
-#ifdef	CP866
     /* Map:  windows-1251 -> KOI8-R (Cyrillic) */
     /* Contributed by: Yar Tikhiy (yarq at users.sourceforge.net) */
     static char xlate_1251[] = {
@@ -295,8 +295,8 @@
 	0xD8, 0xFB,  0xD9, 0xFD,  0xDA, 0xFF,  0xDB, 0xF9,  0xDC, 0xF8,  0xDD, 0xFC,  0xDE, 0xE0,  0xDF, 0xF1,
     };
     map_xlate_characters( xlate_1251, COUNTOF(xlate_1251) );
-#endif
 }
+#endif
 
 #ifdef	CP866
 static void map_windows_1251_to_cp866(void)

----- End patch -----

-- 
Yar



More information about the bogofilter-dev mailing list