A suggestion for non-ASCII Scoring

Fri Jan 23 18:00:05 CET 2004

In spite of having classified thousands of non-ASCII messages as spam, using the "replace_nonascii_characters=yes" option, a couple of non-ASCII messages still get through my filter every day.  (bogofilter version 0.15.4)

The problem is that by including ASCII characters embedded within a non-ASCII word in the token it creates a large number of singletons that aren't effective in filtering new non-ASCII messages.

For example, bogofilter classifies ???I?, ??F??, and b???? as distinct tokens.  Then when I get a message containing ?J???, it is considered a new neutral token rather than a spammy token.

I don't care if it is ???I?, ??F??, b????, or ?J??? - it is all spammy to me.  I would like to propose an option to ignore any ASCII characters within a mostly non-ASCII word and tokenize it as if the word was entirely non-ASCII.  In other words, ???I?, ??F??, b????, and ?J??? would all be tokenized as "?????" rather than as distinct tokens.  I believe this would greatly improve the effectiveness of my non-ASCII spam scoring.

Best regards,

Greg McCann