A suggestion for non-ASCII Scoring

Peter Bishop pgb at adelard.com
Mon Jan 26 10:35:51 CET 2004

On 23 Jan 2004 at 10:58, Greg McCann wrote:

> The only situation where I could see this not being helpful is for users
> that receive legitimate email containing a lot of non-ASCII characters. 
> In that case, they may want to continue scoring non-ASCII words as
> distinct tokens.

Why use the "replace non-ASCII" option in the first place?
I don't - so if I look in my database I see some pretty weird tokens
(Korean/Chinese) but the character sequences still make words
so even if I don't understand them, bogofilter does
So they are classified in the normal way. 
Result - no  Korean spam gets through now.

PS I use a spamtrap as well. But these days I classify the spam first 
(with bogofilter), and only store spam in the database if it has a 
borderline rating.
This saves space and does not appear to affect performance
(the false negative rate is still decreasing)

Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk

More information about the Bogofilter mailing list