A suggestion for non-ASCII Scoring
greg at cambria.com
Fri Jan 23 12:00:05 EST 2004
In spite of having classified thousands of non-ASCII messages as spam, using the "replace_nonascii_characters=yes" option, a couple of non-ASCII messages still get through my filter every day. (bogofilter version 0.15.4)
The problem is that by including ASCII characters embedded within a non-ASCII word in the token it creates a large number of singletons that aren't effective in filtering new non-ASCII messages.
For example, bogofilter classifies ???I?, ??F??, and b???? as distinct tokens. Then when I get a message containing ?J???, it is considered a new neutral token rather than a spammy token.
I don't care if it is ???I?, ??F??, b????, or ?J??? - it is all spammy to me. I would like to propose an option to ignore any ASCII characters within a mostly non-ASCII word and tokenize it as if the word was entirely non-ASCII. In other words, ???I?, ??F??, b????, and ?J??? would all be tokenized as "?????" rather than as distinct tokens. I believe this would greatly improve the effectiveness of my non-ASCII spam scoring.
More information about the Bogofilter