Russian charsets and functions

Sat Jan 8 16:08:58 CET 2005

Pavel Kankovsky wrote:

> I do not think it helps much to discriminate spam from ham when different
>
>encodings of the same word are recognized as two different tokens. In 
>fact, it might make things worse because you need to learn N different
>encodings of a token rather than one (N depends on the country; for 
>instance, we've got 3 popular coded charsets in Czechia used in email:
>ISO 8859-2, its mutilated clone by Microsoft called CP 1250, and
>UTF-8 (*)).
>
>(*) Plus the "ASCII transliteration mode" when letters with diacritical
>marks are replaced by their counterparts without diacritical mark. Anyway
>it is still much better than a several years ago when we had at least 6
>mutually incompatible but widely used coded charsets. :P
>
D:>dc
Universal Russian codepage DeCoder v 0.55b
(c)Evgeny Kotsuba, 1997-2002
usage: dc.exe [-][mode] fileFrom [fileTo] [-Qval][-q][-debug]
mode:  CodepageFrom+CodepageTo+[Text mode][-]
Codepage:  D(os)|K(oi8)|W(in)|M(ac)|I(so)|Q(uoted)|T(ranslit)|U(nicode)
           V(olapyuk)|H(TML)|F(ido)|?(unknown)|*(Last Chance)

So russians have a bit  more codings plus so called "de-bill'ing coding" 
with all words like ???? ????? ???
;-)

>
>On the other hand, the use of certain characters can be a strong indicator
>of spam (like capitalization) and character mapping might wipe this useful
>information out. (One interesting option would be to let the lexer process
>mapped/normalized text to avoid the pollution of its code with the 
>idiosyncracies of every known script but to use the original unmapped 
>text (within the boundaries determined by the lexer) to build tokens.)
>
>Anyway, if you really want to implement this, then I suggest to
>
>1. translate everything to a common coded charset (Unicode/UTF-8),
>  
>
I don't like this idea. I still think that there should be some national 
packs  that should include some API for dealing with different charsets  
from national/user point of  view. For example - I know russian and can 
easily understand what is wrong with russian words decodings etc.

>2. do any kind mapping/normalization on the translated text with a
>   single table for the charset.
>
>Step 1. can be done with iconv() on any but really archaic system, and 
>you get support of a wide set of charsets for free.
>
>The drawback of iconv() is the complexity of error recovery when an 
>incorrect byte sequence is encountered but it can be solved with a little 
>bit of extra work.
>
>  
>
Is it possible to find what was converting sequence in case of error ? 
In case of russian and wrong double converting it is possible only by 
get some stat info on whole text, but what we can do with single words 
in data base ? Even if we find converting sequence will it possible to 
make correct operation ? In case of russian it is limited correctness 
for reverse operations in case of using  cp1251 and  iso8859-5....

>>one problem is that charset may be set improperly - by mail client 
>>and/or spammer,
>>    
>>
>
>...and the result will be unreadable mess (unless the client is "smart" 
>(in the redmondian way) and tries hard to guess the correct charset).
>
>Anyway, if this occurs more frequently then Bogofilter should be able to 
>learn to recognize incorrectly encoded tokens as well.
>  
>
>>second problem will be doubling data base. Really english/americans
>>don't need russian or asian spam or mail, russian don't need asian
>>spam/mail and [...]
>>    
>>
>
>Speak for yourself. :)
>  
>
Of course.
I have found in my base asian ham that really produce a number of 
wrong-binary-like-look words.

>>[...] all english letterrs are placed to 0-127 and russian - to 128-255.
>>    
>>
>
>Are you talking about UTF-8 requiring 2-byte sequences to encode Cyrillic
>characters? Yes, this can be a problem for the Russian language. (To a
>lesser degree, it could be problem for languages using Latin with
>diacritical marks as well. I'd expect 1/4-1/3 bloat for texts in those
>languages.)
>  
>
No, I am talking about pure 8-bit codings

>One approach would be use a more efficient ad-hoc encoding of Unicode
>code point sequences rather than UTF-8. For instance, you can take your 
>db, assign codes 1-254 to 254 most frequent characters or even their
>subsequences (e.g. syllables) and encode the rest as 255 + UTF-8.
>A more generic compression algorithm, like Huffman coding (probably
>static with a global list of codes), could work as well.
>  
>
...and then no one will be able to understand  how it all  work

SY,
EK