second lexer problem
David Relson
relson at osagesoftware.com
Sun Nov 24 20:56:43 CET 2002
Allyn,
I've identified the cause and have a fix.
Given that token "accounts" had a spam count of 1 on linux and 2 on hp-ux,
I suspected another lexer anomaly. So I ran the two training sets through
bogolexer and grepped for "account" in the output. Here's what I see:
[relson at osage tests]$ cat t.systest.d/inputs/*mbx | ../bogolexer -p | sort
| uniq -c | grep account
4 account
1 account"
6 accounts
The odd character is 0x94. As I've also seen 0x93, I'm adding code to
change them both to spaces.
David
--------------------------------------------------------
David Relson Osage Software Systems, Inc.
relson at osagesoftware.com Ann Arbor, MI 48103
www.osagesoftware.com tel: 734.821.8800
More information about the bogofilter-dev
mailing list