... convert_unicode.c ...

Sat Jun 25 16:34:08 CEST 2005

On Sat, 25 Jun 2005 15:39:25 +0200 (CEST)
Pavel Kankovsky wrote:

...[snip]...

> You are right the use of blanks to separate words is rather uncommon.  
> They use interpunction imported from Western scripts (dots, commons,
> parentheses etc.) to separate sentences or parts of sentences.

Bogofilter's parser roughly considers a token to be a sequence of
letters and numbers.  Some special characters are allowed.  For example
a dollar sign can appear in a money amount and periods can appear
within a token.  Must punctuation characters serve as token separators.

For text in charsets like iso-8859-x these rules work quite well.
Exactly how well the lexer's use of punctuation, letters, numbers, etc
will work with messages converted to unicode isn't certain.  The
resulting tokens may not be proper words (in the linguistic sense).
However what we need is tokens that can be used for message
classification.

> One possible approach might be to treat every single hanzi/kanji character
> as a token.
> 
> 
> I know very little about the Korean script but it appears to be based on 
> similar principles.

My understanding is that Korean is a syllabary - each character
corresponds to 1 syllable and words are built from syllables.

> > The additional difficulty is that depending on region, some scripts are
> > right-to-left (names in Taiwan), and some are also column-wise
> > (top-to-bottom within the column, with columns from right to left)
> > although left-to-right is - according to Wikipedia - gaining ground and
> > used particularly when Romanizations or Latin words are used.
> 
> I don't think this is a problem because the characters of hanzi/kanji are
> AFAIK always encoded in their logical order.

I agree.  Presenting characters in any other order than logical makes
no sense!

David