second lexer problem

Matthias Andree matthias.andree at gmx.de
Sun Nov 24 23:23:32 CET 2002


On Sun, 24 Nov 2002, David Relson wrote:

> Given that token "accounts" had a spam count of 1 on linux and 2 on hp-ux, 
> I suspected another lexer anomaly.  So I ran the two training sets through 
> bogolexer and grepped for "account" in the output.  Here's what I see:
> 
> [relson at osage tests]$  cat t.systest.d/inputs/*mbx | ../bogolexer -p | sort 
> | uniq -c | grep account
>       4	account
>       1	account"
>       6	accounts
> 
> The odd character is 0x94.  As I've also seen 0x93, I'm adding code to 
> change them both to spaces.

Well, convert 0x92..0x94 to ' " " (in that order) if that matters. These
are the Windows typographical left upper single quote and then the
typographical left and right upper double quote.

http://www.microsoft.com/globaldev/reference/sbcs/1252.htm

-- 
Matthias Andree



More information about the bogofilter-dev mailing list