A suggestion for non-ASCII Scoring
David Relson
relson at osagesoftware.com
Fri Jan 23 20:14:58 CET 2004
On Fri, 23 Jan 2004 10:58:06 -0800
Greg McCann wrote:
> On 1/23/2004 at 1:13 PM David Relson <relson at osagesoftware.com> wrote:
> The only situation where I could see this not being helpful is for
> users that receive legitimate email containing a lot of non-ASCII
> characters. In that case, they may want to continue scoring non-ASCII
> words as distinct tokens.
That would be true of most speakers of European languages. Seems like
all 5 vowels have 3 accents each and several letters have them as well
-- and all those characters have the high bit (0x80) on.
> Not to make things too complicated, but users could have maximum
> flexibility in handling non-ASCII messages with a three-level scoring
> option:
>
> replace_nonascii_characters=no no non-ASCII substitution
> replace_nonascii_characters=yes substitute individual
> non-ASCII characters with ? replace_nonascii_characters=whole_word
> tokenize the whole non-ASCII word as ?...
Not a bad way to do it.
More information about the Bogofilter
mailing list