A suggestion for non-ASCII Scoring
Peter Bishop
pgb at adelard.com
Mon Jan 26 10:35:51 CET 2004
On 23 Jan 2004 at 10:58, Greg McCann wrote:
> The only situation where I could see this not being helpful is for users
> that receive legitimate email containing a lot of non-ASCII characters.
> In that case, they may want to continue scoring non-ASCII words as
> distinct tokens.
>
Why use the "replace non-ASCII" option in the first place?
I don't - so if I look in my database I see some pretty weird tokens
(Korean/Chinese) but the character sequences still make words
so even if I don't understand them, bogofilter does
So they are classified in the normal way.
Result - no Korean spam gets through now.
PS I use a spamtrap as well. But these days I classify the spam first
(with bogofilter), and only store spam in the database if it has a
borderline rating.
This saves space and does not appear to affect performance
(the false negative rate is still decreasing)
--
Peter Bishop
pgb at adelard.com
pgb at csr.city.ac.uk
More information about the Bogofilter
mailing list