A suggestion for non-ASCII Scoring

Fri Jan 23 20:26:55 CET 2004

On Fri, 23 Jan 2004, David Relson wrote:

> > The only situation where I could see this not being helpful is for
> > users that receive legitimate email containing a lot of non-ASCII
> > characters. In that case, they may want to continue scoring non-ASCII
> > words as distinct tokens.
> 
> That would be true of most speakers of European languages.  Seems like
> all 5 vowels have 3 accents each and several letters have them as well
> -- and all those characters have the high bit (0x80) on.

Not to mention Cyrillic/Greek, where all the characters will have it.

Generally, though, it does seem like one should be 'less caring' about
particular details of how this or that word is mutilated. Simply because
there are way too many ways to mutilate. Theoretically, yes, of course,
after enough training they will start repeating and helping you classify
the message as spam -- but practically, in many cases you won't live to
see this happen.

I'm not sure I can offer an assertive action plan, but, IMHO, it's worth
thinking about some more...

                                                     Stefan