Russian charsets and functions

Wed Jan 5 16:11:32 CET 2005

On Mon, 3 Jan 2005, David Relson wrote:

> I'm willing for bogofilter to include all the language tables.  However
> there are multiple, conflicting table entries and mapping functions.  If
> someone who knows more than I do would care to provide direction, it'd
> be helpful.

This mapping and charset translation bussiness is similar to the question
whether capitalization should be preserved or not (i.e. "word" vs "Word").
It can hurt in one case and it can help in another case.

I do not think it helps much to discriminate spam from ham when different 
encodings of the same word are recognized as two different tokens. In 
fact, it might make things worse because you need to learn N different
encodings of a token rather than one (N depends on the country; for 
instance, we've got 3 popular coded charsets in Czechia used in email:
ISO 8859-2, its mutilated clone by Microsoft called CP 1250, and
UTF-8 (*)).

(*) Plus the "ASCII transliteration mode" when letters with diacritical
marks are replaced by their counterparts without diacritical mark. Anyway
it is still much better than a several years ago when we had at least 6
mutually incompatible but widely used coded charsets. :P

On the other hand, the use of certain characters can be a strong indicator
of spam (like capitalization) and character mapping might wipe this useful
information out. (One interesting option would be to let the lexer process
mapped/normalized text to avoid the pollution of its code with the 
idiosyncracies of every known script but to use the original unmapped 
text (within the boundaries determined by the lexer) to build tokens.)

Anyway, if you really want to implement this, then I suggest to

1. translate everything to a common coded charset (Unicode/UTF-8),

2. do any kind mapping/normalization on the translated text with a
   single table for the charset.

Step 1. can be done with iconv() on any but really archaic system, and 
you get support of a wide set of charsets for free.

The drawback of iconv() is the complexity of error recovery when an 
incorrect byte sequence is encountered but it can be solved with a little 
bit of extra work.

On Tue, 4 Jan 2005, Evgeny Kotsuba wrote:

> one problem is that charset may be set impropelly - by mail client 
> and/or spammer,

...and the result will be unreadable mess (unless the client is "smart" 
(in the redmondian way) and tries hard to guess the correct charset).

Anyway, if this occurs more frequently then Bogofilter should be able to 
learn to recognize incorrectly encoded tokens as well.

> second problem will be doubling data base. Really english/americans
> don't need russian or asian spam or mail, russian don't need asian
> spam/mail and [...]

Speak for yourself. :)

> [...] all english letterrs are placed to 0-127 and russian - to 128-255.

Are you talking about UTF-8 requiring 2-byte sequences to encode Cyrillic
characters? Yes, this can be a problem for the Russian language. (To a
lesser degree, it could be problem for languages using Latin with
diacritical marks as well. I'd expect 1/4-1/3 bloat for texts in those
languages.)

One approach would be use a more efficient ad-hoc encoding of Unicode
code point sequences rather than UTF-8. For instance, you can take your 
db, assign codes 1-254 to 254 most frequent characters or even their
subsequences (e.g. syllables) and encode the rest as 255 + UTF-8.
A more generic compression algorithm, like Huffman coding (probably
static with a global list of codes), could work as well.

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."