A suggestion for non-ASCII Scoring
    David Relson 
    relson at osagesoftware.com
       
    Fri Jan 23 20:14:58 CET 2004
    
    
  
On Fri, 23 Jan 2004 10:58:06 -0800
Greg McCann wrote:
> On 1/23/2004 at 1:13 PM David Relson <relson at osagesoftware.com> wrote:
> The only situation where I could see this not being helpful is for
> users that receive legitimate email containing a lot of non-ASCII
> characters. In that case, they may want to continue scoring non-ASCII
> words as distinct tokens.
That would be true of most speakers of European languages.  Seems like
all 5 vowels have 3 accents each and several letters have them as well
-- and all those characters have the high bit (0x80) on.
> Not to make things too complicated, but users could have maximum
> flexibility in handling non-ASCII messages with a three-level scoring
> option:
> 
> replace_nonascii_characters=no            no non-ASCII substitution
> replace_nonascii_characters=yes           substitute individual
> non-ASCII characters with ? replace_nonascii_characters=whole_word   
> tokenize the whole non-ASCII word as ?...
Not a bad way to do it.
    
    
More information about the bogofilter
mailing list