second lexer problem
Matthias Andree
matthias.andree at gmx.de
Sun Nov 24 23:23:32 CET 2002
On Sun, 24 Nov 2002, David Relson wrote:
> Given that token "accounts" had a spam count of 1 on linux and 2 on hp-ux,
> I suspected another lexer anomaly. So I ran the two training sets through
> bogolexer and grepped for "account" in the output. Here's what I see:
>
> [relson at osage tests]$ cat t.systest.d/inputs/*mbx | ../bogolexer -p | sort
> | uniq -c | grep account
> 4 account
> 1 account"
> 6 accounts
>
> The odd character is 0x94. As I've also seen 0x93, I'm adding code to
> change them both to spaces.
Well, convert 0x92..0x94 to ' " " (in that order) if that matters. These
are the Windows typographical left upper single quote and then the
typographical left and right upper double quote.
http://www.microsoft.com/globaldev/reference/sbcs/1252.htm
--
Matthias Andree
More information about the bogofilter-dev
mailing list