... convert_unicode.c ...

Sat Jun 25 02:18:39 CEST 2005

David Relson <relson at osagesoftware.com> writes:

> My plan for Chinese is:  change nothing; continue doing what we're
> doing.  Remember our parsing goal is to create tokens that can be used
> for scoring.  With our present parsing we have tokens that are not
> words, for example "$1.00".  These tokens work perfectly well for our
> purposes, which is classifying messages.  Similarly as long as our
> parsing of Chinese gives tokens that are usable for scoring, we're
> fine.

I don't know how much entropy such a Chinese "token" carries in itself
and how much needs to be derived from context. It appears that context
matters a lot.

> The fact that Chinese can be written vertically or right to left isn't
> relevant.  That's a rendering issue.

It isn't if tokens are written left-to-right in one text and then in the
reverse direction.

-- 
Matthias Andree