second lexer problem

David Relson relson at osagesoftware.com
Sun Nov 24 21:17:03 CET 2002


At 03:05 PM 11/24/02, Allyn Fratkin wrote:

>David Relson wrote:
>
>>The odd character is 0x94.  As I've also seen 0x93, I'm adding code to
>>change them both to spaces.
>
>then i'd be suspect of anything in the range of 0x90-0xa0 at least.
>but what will these changes do to our "support" of other languages?

Yep.  My linux box shows 0x92 as apostrophe and 0xA0 as space (both are 
reasonable interpretations/translations).  The (new) training set and test 
messages also contain 0x93 and 0x94 (brackets or quotes that look like 
chevrons), and 0xA9 (copyright sign).  For a short term solution, I'm going 
to convert use the reasonable interpretations and translate the others to 
spaces.  If this is a problem, we'll have to move forward to more general 
character set handling - a project I've started but am not ready to release.

David






More information about the bogofilter-dev mailing list