Russian charsets and functions

Sun Jan 9 01:29:45 CET 2005

On Sat, 08 Jan 2005 18:08:58 +0300
Evgeny Kotsuba wrote:

> Pavel Kankovsky wrote:
> 
> > I do not think it helps much to discriminate spam from ham when different
> >
> >encodings of the same word are recognized as two different tokens. In 
> >fact, it might make things worse because you need to learn N different
> >encodings of a token rather than one (N depends on the country; for 
> >instance, we've got 3 popular coded charsets in Czechia used in email:
> >ISO 8859-2, its mutilated clone by Microsoft called CP 1250, and
> >UTF-8 (*)).
> >
> >(*) Plus the "ASCII transliteration mode" when letters with diacritical
> >marks are replaced by their counterparts without diacritical mark. Anyway
> >it is still much better than a several years ago when we had at least 6
> >mutually incompatible but widely used coded charsets. :P
> >
> D:>dc
> Universal Russian codepage DeCoder v 0.55b
> (c)Evgeny Kotsuba, 1997-2002
> usage: dc.exe [-][mode] fileFrom [fileTo] [-Qval][-q][-debug]
> mode:  CodepageFrom+CodepageTo+[Text mode][-]
> Codepage:  D(os)|K(oi8)|W(in)|M(ac)|I(so)|Q(uoted)|T(ranslit)|U(nicode)
>            V(olapyuk)|H(TML)|F(ido)|?(unknown)|*(Last Chance)
> 
> So russians have a bit  more codings plus so called "de-bill'ing coding" 
> with all words like ???? ????? ???
> ;-)

Using charsets, bogofilter's default character set is "us-ascii" with
special handling of some characters (for example mapping 0x92 to
apostrophe).  

An iconv_open call looks like:

  iconv_t iconv_open(const char *tocode, const char *fromcode);

For consistency with present databases, tocode and fromcode should both
be "us-ascii", i.e. whatever charset is set for DEFAULT_CHARSET.  For
proper internationalization, I suspect that "UTF-8" should be the
default (rather than "us-ascii").

Since ./configure allows "--default-charset=UTF-8" (or CP866 or KOI-R),
the value of DEFAULT_CHARSET can be set to personal preferences.

> >On the other hand, the use of certain characters can be a strong indicator
> >of spam (like capitalization) and character mapping might wipe this useful
> >information out. (One interesting option would be to let the lexer process
> >mapped/normalized text to avoid the pollution of its code with the 
> >idiosyncracies of every known script but to use the original unmapped 
> >text (within the boundaries determined by the lexer) to build tokens.)
> >
> >Anyway, if you really want to implement this, then I suggest to
> >
> >1. translate everything to a common coded charset (Unicode/UTF-8),
> >  
> >
> I don't like this idea. I still think that there should be some national 
> packs  that should include some API for dealing with different charsets  
> from national/user point of  view. For example - I know russian and can 
> easily understand what is wrong with russian words decodings etc.

Wouldn't it work to specify russian (either CP866 or KOI8-R) for iconv's default
fromcode and tocode values?

> >2. do any kind mapping/normalization on the translated text with a
> >   single table for the charset.
> >
> >Step 1. can be done with iconv() on any but really archaic system, and 
> >you get support of a wide set of charsets for free.
> >
> >The drawback of iconv() is the complexity of error recovery when an 
> >incorrect byte sequence is encountered but it can be solved with a little 
> >bit of extra work.
> >
> >  
> >
> Is it possible to find what was converting sequence in case of error ? 
> In case of russian and wrong double converting it is possible only by 
> get some stat info on whole text, but what we can do with single words 
> in data base ? Even if we find converting sequence will it possible to 
> make correct operation ? In case of russian it is limited correctness 
> for reverse operations in case of using  cp1251 and  iso8859-5....

Passthrough mode "-p" outputs the original text of the message (without
charset translation) and that won't change.  At present, tokens used for
wordlist searching and spam scoring use the charset translated text. 
What will change when iconv is used is how the translation occurs.

...[snip]...

> >Are you talking about UTF-8 requiring 2-byte sequences to encode Cyrillic
> >characters? Yes, this can be a problem for the Russian language. (To a
> >lesser degree, it could be problem for languages using Latin with
> >diacritical marks as well. I'd expect 1/4-1/3 bloat for texts in those
> >languages.)
> >  
> >
> No, I am talking about pure 8-bit codings

As mentioned, one of the Cyrillic character sets can be selected as the
default charset :-)

> >One approach would be use a more efficient ad-hoc encoding of Unicode
> >code point sequences rather than UTF-8. For instance, you can take your 
> >db, assign codes 1-254 to 254 most frequent characters or even their
> >subsequences (e.g. syllables) and encode the rest as 255 + UTF-8.
> >A more generic compression algorithm, like Huffman coding (probably
> >static with a global list of codes), could work as well.
> >  
> >
> ...and then no one will be able to understand  how it all  work

AFAIK, iconv is documented, understood, and pretty much a standard way
of operating, is it not?

Regards,

David