A suggestion for non-ASCII Scoring

Fri Jan 23 20:14:58 CET 2004

On Fri, 23 Jan 2004 10:58:06 -0800
Greg McCann wrote:

> On 1/23/2004 at 1:13 PM David Relson <relson at osagesoftware.com> wrote:

> The only situation where I could see this not being helpful is for
> users that receive legitimate email containing a lot of non-ASCII
> characters. In that case, they may want to continue scoring non-ASCII
> words as distinct tokens.

That would be true of most speakers of European languages.  Seems like
all 5 vowels have 3 accents each and several letters have them as well
-- and all those characters have the high bit (0x80) on.

> Not to make things too complicated, but users could have maximum
> flexibility in handling non-ASCII messages with a three-level scoring
> option:
> 
> replace_nonascii_characters=no            no non-ASCII substitution
> replace_nonascii_characters=yes           substitute individual
> non-ASCII characters with ? replace_nonascii_characters=whole_word   
> tokenize the whole non-ASCII word as ?...

Not a bad way to do it.