... convert_unicode.c ...

Matthias Andree matthias.andree at gmx.de
Sat Jun 25 17:11:20 CEST 2005


"Pavel Kankovsky" <peak at argo.troja.mff.cuni.cz> writes:

> On Fri, 24 Jun 2005, Matthias Andree wrote:
>
>> As far as I can tell from the Wikipedia information about Chinese and
>> Taiwanese scripts, it is uncommon to separate words with blanks in these
>> languages, separation is rather used for splitting up paragraphs or
>> ideas.
>
> AFAIK, the basic unit of hanzi, the Chinese script, is a monosyllabilic
> word corresponding to a single hanzi character. More complex concepts, are
> expressed as multisyllabic phrases (e.g. schizophrenia is expressed as
> something like "split personality disease"), these phrases are often
> considered to be long words.

Side note, no matter how you count, whether you count all historic
scripts to 80,000 written words, or just a few thousand in daily use,
there are sure to be lots of homophones. Not that bogofilter cares about
pronunciation - but homographs may be a problem, but I'd suggest that we
don't care for now. The interesting thing is that around a billion
people should be able to *read* chinese, even though it's not sure if
two of these understand each other as they talk.

If a graph is a word is a syllable, we'll be fine with breaking these
character sets up at character boundaries, or emit pairs of words if the
information content of such a word is low without context. That is
likely to be post-1.0 stuff though.

> The Taiwanese script is in fact *the* Chinese script because people of
> Taiwan decided to stick to the traditional hanzi. It is known as
> "traditional Chinese" and BIG5 is the most common encoding for this 
> script.

I can imagine traditional Chinese is A LOT faster when hand-written. If
it's read more easily, I don't know. There are sure to be difficulties
a) of elder people (born in the 1940s or before) reading newer texts,
b) of younger people reading texts before 1956 or whenever China
introduced the simplification.

In the German-speaking countries, later this summer a spelling reform is
becoming effective and mandatory for schools and official use after
seven years of transitional period.

While it has some easy to grasp and reasonable ideas of getting rid of
special cases (say, Schiffahrt (navigation, shipping) is now spelled
Schifffahrt, being composed of Schiff (ship, boat) and Fahrt (trip,
journey, ride), it has some controversial rules WRT writing composites
as one words or separately. There will be parallel spellings for a
while, Fluss (new) and Fluß (old) and similar, Stengel <-> Stängel (stem
of a plant).

Sometimes, the miniscule spelling difference conveys meaning though, say

auseinandersetzen = deal with, look into

auseinander setzen = relocate (a pupil to a different desk if
                     (s)he's chatting to much with his/her neighbour)

> One final observation: it appears to be quite common feature of Asian HTML
> spam not to provide the right charset in HTML <meta> tag rather in MIME 
> headers i.e. Content-Type (in fact, I can find a few samples where there 
> is an explicit (!) bogus charset in Content-Type, e.g. US-ASCII).

This appears a bit contradictory. Are you saying that we should look at
the HTML META tag if present, and not at the Content-Type?

-- 
Matthias Andree



More information about the bogofilter-dev mailing list