... convert_unicode.c ...
David Relson
relson at osagesoftware.com
Sat Jun 25 04:22:47 CEST 2005
On Sat, 25 Jun 2005 02:18:39 +0200
Matthias Andree wrote:
> David Relson <relson at osagesoftware.com> writes:
>
> > My plan for Chinese is: change nothing; continue doing what we're
> > doing. Remember our parsing goal is to create tokens that can be used
> > for scoring. With our present parsing we have tokens that are not
> > words, for example "$1.00". These tokens work perfectly well for our
> > purposes, which is classifying messages. Similarly as long as our
> > parsing of Chinese gives tokens that are usable for scoring, we're
> > fine.
>
> I don't know how much entropy such a Chinese "token" carries in itself
> and how much needs to be derived from context. It appears that context
> matters a lot.
>
> > The fact that Chinese can be written vertically or right to left isn't
> > relevant. That's a rendering issue.
>
> It isn't if tokens are written left-to-right in one text and then in the
> reverse direction.
>
> --
> Matthias Andree
My thought is that the order of characters in a message reflects the
reading order. There's a first character, then a second, etc. Whether
they're displayed left to right, top to bottom, or right to left is a
separate aspect of the message. Of course, I've no expertise in this
area and am just applying common sense to the issue.
More information about the bogofilter-dev
mailing list