... convert_unicode.c ...

Pavel Kankovsky peak at argo.troja.mff.cuni.cz
Sat Jun 25 15:39:25 CEST 2005


On Fri, 24 Jun 2005, Matthias Andree wrote:

> As far as I can tell from the Wikipedia information about Chinese and
> Taiwanese scripts, it is uncommon to separate words with blanks in these
> languages, separation is rather used for splitting up paragraphs or
> ideas.

AFAIK, the basic unit of hanzi, the Chinese script, is a monosyllabilic
word corresponding to a single hanzi character. More complex concepts, are
expressed as multisyllabic phrases (e.g. schizophrenia is expressed as
something like "split personality disease"), these phrases are often
considered to be long words.

The Taiwanese script is in fact *the* Chinese script because people of
Taiwan decided to stick to the traditional hanzi. It is known as
"traditional Chinese" and BIG5 is the most common encoding for this 
script.

The script used in the People's Republic of China is an (allegedly)
simplified form of hanzi invented in order to (allegedly) make it easier
to spread literacy. It is known as "simplified Chinese" and GB2312 appears
to be its most common encoding.

To make the mess even more messy, the Japanese script contains kanji which
is based on Chinese hanzi but its phonetic nature (one character = one 
syllable) was lost. So we can have N-syllabilic Japanese word written as 
M kanji characters for virtually any combination of N and M. Moreover, 
they intermix kanji with katagana and hiragana (their phonetic alphabets) 
rather freely. ISO-2022-JP appears to be the most common encoding used in 
Japan.

You are right the use of blanks to separate words is rather uncommon.  
They use interpunction imported from Western scripts (dots, commons,
parentheses etc.) to separate sentences or parts of sentences.

One possible approach might be to treat every single hanzi/kanji character
as a token.


I know very little about the Korean script but it appears to be based on 
similar principles.


> The additional difficulty is that depending on region, some scripts are
> right-to-left (names in Taiwan), and some are also column-wise
> (top-to-bottom within the column, with columns from right to left)
> although left-to-right is - according to Wikipedia - gaining ground and
> used particularly when Romanizations or Latin words are used.

I don't think this is a problem because the characters of hanzi/kanji are
AFAIK always encoded in their logical order.


One final observation: it appears to be quite common feature of Asian HTML
spam not to provide the right charset in HTML <meta> tag rather in MIME 
headers i.e. Content-Type (in fact, I can find a few samples where there 
is an explicit (!) bogus charset in Content-Type, e.g. US-ASCII).


--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."




More information about the bogofilter-dev mailing list