second lexer problem

David Relson relson at osagesoftware.com
Sun Nov 24 20:56:43 CET 2002


Allyn,

I've identified the cause and have a fix.

Given that token "accounts" had a spam count of 1 on linux and 2 on hp-ux, 
I suspected another lexer anomaly.  So I ran the two training sets through 
bogolexer and grepped for "account" in the output.  Here's what I see:

[relson at osage tests]$  cat t.systest.d/inputs/*mbx | ../bogolexer -p | sort 
| uniq -c | grep account
       4	account
       1	account"
       6	accounts

The odd character is 0x94.  As I've also seen 0x93, I'm adding code to 
change them both to spaces.

David
--------------------------------------------------------
David Relson                   Osage Software Systems, Inc.
relson at osagesoftware.com       Ann Arbor, MI 48103
www.osagesoftware.com          tel:  734.821.8800





More information about the bogofilter-dev mailing list