... convert_unicode.c ...
Matthias Andree
matthias.andree at gmx.de
Sat Jun 25 02:18:39 CEST 2005
David Relson <relson at osagesoftware.com> writes:
> My plan for Chinese is: change nothing; continue doing what we're
> doing. Remember our parsing goal is to create tokens that can be used
> for scoring. With our present parsing we have tokens that are not
> words, for example "$1.00". These tokens work perfectly well for our
> purposes, which is classifying messages. Similarly as long as our
> parsing of Chinese gives tokens that are usable for scoring, we're
> fine.
I don't know how much entropy such a Chinese "token" carries in itself
and how much needs to be derived from context. It appears that context
matters a lot.
> The fact that Chinese can be written vertically or right to left isn't
> relevant. That's a rendering issue.
It isn't if tokens are written left-to-right in one text and then in the
reverse direction.
--
Matthias Andree
More information about the bogofilter-dev
mailing list