second lexer problem
relson at osagesoftware.com
Sun Nov 24 15:17:03 EST 2002
At 03:05 PM 11/24/02, Allyn Fratkin wrote:
>David Relson wrote:
>>The odd character is 0x94. As I've also seen 0x93, I'm adding code to
>>change them both to spaces.
>then i'd be suspect of anything in the range of 0x90-0xa0 at least.
>but what will these changes do to our "support" of other languages?
Yep. My linux box shows 0x92 as apostrophe and 0xA0 as space (both are
reasonable interpretations/translations). The (new) training set and test
messages also contain 0x93 and 0x94 (brackets or quotes that look like
chevrons), and 0xA9 (copyright sign). For a short term solution, I'm going
to convert use the reasonable interpretations and translate the others to
spaces. If this is a problem, we'll have to move forward to more general
character set handling - a project I've started but am not ready to release.
More information about the Bogofilter-dev