A suggestion for non-ASCII Scoring

Greg McCann greg at cambria.com
Mon Jan 26 19:07:10 CET 2004


On 1/26/2004 at 7:40 AM David Relson <relson at osagesoftware.com> wrote:

>On Mon, 26 Jan 2004 09:35:51 -0000
>Peter Bishop wrote:

>> Why use the "replace non-ASCII" option in the first place?

>replace-nonascii cuts down the number of weird tokens.  It's a space
>saver.  That's all.

As David mentions, scoring all the spammy words in all possible non-ASCII languages takes a lot of space in your database.  If you are a speaker of a non-ASCII language, you will want a high degree of discrimination between the non-ASCII words in your email and you will want each of these words scored separately.  But for those of us whose language uses ASCII characters almost exclusively, this is wasted space since *any* non-ASCII content indicates a high probability of spam.

It also takes more training to be able to accurately recognize all spammy non-ASCII words.  Before the recent patch that David kindly supplied for me, a lot of non-ASCII email would get through my filters because (even using the current non-ASCII substitution) it would contain many words that bogofilter had never seen before and would be scored as neutral.

Currently, bogofilter's non-ASCII option substitutes any non-ASCII character with "?", so, for example, instead of saving every unique non-ASCII five-letter word in your database, you get tokens like ??A??, ?b???, and ?K??f.  However, even this level of compression leaves you with a large number of low-count tokens which often do not match new spam.  I am proposing an option to take this compression one step further and tokenize any predominantly non-ASCII word as all "?" characters.  This further reduces the number of low-count tokens in the database and increases the likelihood of new non-ASCII spam being scored correctly.  Users who do receive legitimate non-ASCII email and require more discrimination between non-ASCII words will want to continue to use either the current non-ASCII option, or do no non-ASCII substitution at all.  I suggest that this option should be in addition to, rather than in place of, the current non-ASCII option.  This would allow users to determine the level of non-ASCII substitution that works best for them - none, individual characters, or whole words.


Greg McCann









More information about the Bogofilter mailing list