... convert_unicode.c ...

Sat Jun 25 00:40:12 CEST 2005

On Fri, 24 Jun 2005 14:22:28 +0200
Matthias Andree wrote:

> On Fri, 24 Jun 2005, David Relson wrote:
> 
> > The parsing of Korean and
> > Chinese may not produce tokens that are meaningful to someone who
> > speaks the languages.  However the tokens work well for bogofilter's
> > classification and that's what we need.
> 
> As far as I can tell from the Wikipedia information about Chinese and
> Taiwanese scripts, it is uncommon to separate words with blanks in these
> languages, separation is rather used for splitting up paragraphs or
> ideas. The additional difficulty is that depending on region, some
> scripts are right-to-left (names in Taiwan), and some are also
> column-wise (top-to-bottom within the column, with columns from right to
> left) although left-to-right is - according to Wikipedia - gaining
> ground and used particularly when Romanizations or Latin words are used.
> 
> > We'll deal with the issue when someone
> > appears who knows about the problem domain.  
> 
> Indeed, and it may well be that we'd either need an external library
> that has some understanding of Chinese or Japanese or Korean scripts, in
> as far as it can "see" common pre- or suffixes, but I do not know if
> it's reasonable for software to determine word boundaries without some
> artificial intelligence. We might get away with just bundling runs of
> graphs (of perhaps two or three graphs) and scoring these.

Matthias,

My plan for Chinese is:  change nothing; continue doing what we're
doing.  Remember our parsing goal is to create tokens that can be used
for scoring.  With our present parsing we have tokens that are not
words, for example "$1.00".  These tokens work perfectly well for our
purposes, which is classifying messages.  Similarly as long as our
parsing of Chinese gives tokens that are usable for scoring, we're fine.

The fact that Chinese can be written vertically or right to left isn't
relevant.  That's a rendering issue.  An email system sees a sequence
of bytes and rendering (in appropriate direction) converts the bytes
into a readable message.  As long as our parsing supports
classification, we're good.

If it ever becomes necessary to parse the bytes into words, the reason
will be that someone knowledgeable wants it done.  When that person
appears, he/she can contribute the code.  It's also possible that a
tool other than bogofilter will be the proper tool for working with
asian languages.  Either way (contributed code or different tool) works
for me.  My ambitions don't include processing a language I have no
expectation of ever using.

Regards,

David