... convert_unicode.c ...

Sat Jun 25 04:22:47 CEST 2005

On Sat, 25 Jun 2005 02:18:39 +0200
Matthias Andree wrote:

> David Relson <relson at osagesoftware.com> writes:
> 
> > My plan for Chinese is:  change nothing; continue doing what we're
> > doing.  Remember our parsing goal is to create tokens that can be used
> > for scoring.  With our present parsing we have tokens that are not
> > words, for example "$1.00".  These tokens work perfectly well for our
> > purposes, which is classifying messages.  Similarly as long as our
> > parsing of Chinese gives tokens that are usable for scoring, we're
> > fine.
> 
> I don't know how much entropy such a Chinese "token" carries in itself
> and how much needs to be derived from context. It appears that context
> matters a lot.
> 
> > The fact that Chinese can be written vertically or right to left isn't
> > relevant.  That's a rendering issue.
> 
> It isn't if tokens are written left-to-right in one text and then in the
> reverse direction.
> 
> -- 
> Matthias Andree

My thought is that the order of characters in a message reflects the
reading order.  There's a first character, then a second, etc.  Whether
they're displayed left to right, top to bottom, or right to left is a
separate aspect of the message.  Of course, I've no expertise in this
area and am just applying common sense to the issue.