A suggestion for non-ASCII Scoring

Mon Jan 26 10:35:51 CET 2004

On 23 Jan 2004 at 10:58, Greg McCann wrote:

> The only situation where I could see this not being helpful is for users
> that receive legitimate email containing a lot of non-ASCII characters. 
> In that case, they may want to continue scoring non-ASCII words as
> distinct tokens.
> 

Why use the "replace non-ASCII" option in the first place?
I don't - so if I look in my database I see some pretty weird tokens
(Korean/Chinese) but the character sequences still make words
so even if I don't understand them, bogofilter does
So they are classified in the normal way. 
Result - no  Korean spam gets through now.

PS I use a spamtrap as well. But these days I classify the spam first 
(with bogofilter), and only store spam in the database if it has a 
borderline rating.
This saves space and does not appear to affect performance
(the false negative rate is still decreasing)

-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk