Russian charsets and functions
Evgeny Kotsuba
evgen at shatura.laser.ru
Sat Jan 8 16:08:58 CET 2005
Pavel Kankovsky wrote:
> I do not think it helps much to discriminate spam from ham when different
>
>encodings of the same word are recognized as two different tokens. In
>fact, it might make things worse because you need to learn N different
>encodings of a token rather than one (N depends on the country; for
>instance, we've got 3 popular coded charsets in Czechia used in email:
>ISO 8859-2, its mutilated clone by Microsoft called CP 1250, and
>UTF-8 (*)).
>
>(*) Plus the "ASCII transliteration mode" when letters with diacritical
>marks are replaced by their counterparts without diacritical mark. Anyway
>it is still much better than a several years ago when we had at least 6
>mutually incompatible but widely used coded charsets. :P
>
D:>dc
Universal Russian codepage DeCoder v 0.55b
(c)Evgeny Kotsuba, 1997-2002
usage: dc.exe [-][mode] fileFrom [fileTo] [-Qval][-q][-debug]
mode: CodepageFrom+CodepageTo+[Text mode][-]
Codepage: D(os)|K(oi8)|W(in)|M(ac)|I(so)|Q(uoted)|T(ranslit)|U(nicode)
V(olapyuk)|H(TML)|F(ido)|?(unknown)|*(Last Chance)
So russians have a bit more codings plus so called "de-bill'ing coding"
with all words like ???? ????? ???
;-)
>
>On the other hand, the use of certain characters can be a strong indicator
>of spam (like capitalization) and character mapping might wipe this useful
>information out. (One interesting option would be to let the lexer process
>mapped/normalized text to avoid the pollution of its code with the
>idiosyncracies of every known script but to use the original unmapped
>text (within the boundaries determined by the lexer) to build tokens.)
>
>Anyway, if you really want to implement this, then I suggest to
>
>1. translate everything to a common coded charset (Unicode/UTF-8),
>
>
I don't like this idea. I still think that there should be some national
packs that should include some API for dealing with different charsets
from national/user point of view. For example - I know russian and can
easily understand what is wrong with russian words decodings etc.
>2. do any kind mapping/normalization on the translated text with a
> single table for the charset.
>
>Step 1. can be done with iconv() on any but really archaic system, and
>you get support of a wide set of charsets for free.
>
>The drawback of iconv() is the complexity of error recovery when an
>incorrect byte sequence is encountered but it can be solved with a little
>bit of extra work.
>
>
>
Is it possible to find what was converting sequence in case of error ?
In case of russian and wrong double converting it is possible only by
get some stat info on whole text, but what we can do with single words
in data base ? Even if we find converting sequence will it possible to
make correct operation ? In case of russian it is limited correctness
for reverse operations in case of using cp1251 and iso8859-5....
>>one problem is that charset may be set improperly - by mail client
>>and/or spammer,
>>
>>
>
>...and the result will be unreadable mess (unless the client is "smart"
>(in the redmondian way) and tries hard to guess the correct charset).
>
>Anyway, if this occurs more frequently then Bogofilter should be able to
>learn to recognize incorrectly encoded tokens as well.
>
>
>>second problem will be doubling data base. Really english/americans
>>don't need russian or asian spam or mail, russian don't need asian
>>spam/mail and [...]
>>
>>
>
>Speak for yourself. :)
>
>
Of course.
I have found in my base asian ham that really produce a number of
wrong-binary-like-look words.
>>[...] all english letterrs are placed to 0-127 and russian - to 128-255.
>>
>>
>
>Are you talking about UTF-8 requiring 2-byte sequences to encode Cyrillic
>characters? Yes, this can be a problem for the Russian language. (To a
>lesser degree, it could be problem for languages using Latin with
>diacritical marks as well. I'd expect 1/4-1/3 bloat for texts in those
>languages.)
>
>
No, I am talking about pure 8-bit codings
>One approach would be use a more efficient ad-hoc encoding of Unicode
>code point sequences rather than UTF-8. For instance, you can take your
>db, assign codes 1-254 to 254 most frequent characters or even their
>subsequences (e.g. syllables) and encode the rest as 255 + UTF-8.
>A more generic compression algorithm, like Huffman coding (probably
>static with a global list of codes), could work as well.
>
>
...and then no one will be able to understand how it all work
SY,
EK
More information about the bogofilter-dev
mailing list