... convert_unicode.c ...

Fri Jun 24 14:22:28 CEST 2005

On Fri, 24 Jun 2005, David Relson wrote:

> The parsing of Korean and
> Chinese may not produce tokens that are meaningful to someone who
> speaks the languages.  However the tokens work well for bogofilter's
> classification and that's what we need.

As far as I can tell from the Wikipedia information about Chinese and
Taiwanese scripts, it is uncommon to separate words with blanks in these
languages, separation is rather used for splitting up paragraphs or
ideas. The additional difficulty is that depending on region, some
scripts are right-to-left (names in Taiwan), and some are also
column-wise (top-to-bottom within the column, with columns from right to
left) although left-to-right is - according to Wikipedia - gaining
ground and used particularly when Romanizations or Latin words are used.

> We'll deal with the issue when someone
> appears who knows about the problem domain.  

Indeed, and it may well be that we'd either need an external library
that has some understanding of Chinese or Japanese or Korean scripts, in
as far as it can "see" common pre- or suffixes, but I do not know if
it's reasonable for software to determine word boundaries without some
artificial intelligence. We might get away with just bundling runs of
graphs (of perhaps two or three graphs) and scoring these.

-- 
Matthias Andree