A suggestion for non-ASCII Scoring

Mon Jan 26 20:19:58 CET 2004

"Greg McCann" <greg at cambria.com> wrote:

>But for those of us whose language uses ASCII characters almost exclusively, this is wasted space since *any* non-ASCII content indicates a high probability of spam.

Not true. There are quotes (not "), dashes and more. They
might appear not too often, but I do see them. So it is not
*any* non-ASCII. But anyhow, that would not be important in
this case.

In this case it could be more promising to not allow
non-ASCII in the first place.

>It also takes more training to be able to accurately recognize all spammy non-ASCII words.  

I doubt that. For languages like German or French it does
not make a difference, there are few words only which only
differ in a non-ASCII character. It might look like true for
Asian languages, but that works surprisingly well with only
very few messages used in training (if you only use
train-on-error you can see that).

>Before the recent patch that David kindly supplied for me, a lot of non-ASCII email would get through my filters because (even using the current non-ASCII substitution) it would contain many words that bogofilter had never seen before and would be scored as neutral.

That would not be a reason for an error. Either you see too
many good or to few bad words. Could be your settings are
not really well chosen.

>Currently, bogofilter's non-ASCII option substitutes any non-ASCII character with "?", so, for example, instead of saving every unique non-ASCII five-letter word in your database, you get tokens like ??A??, ?b???, and ?K??f.  However, even this level of compression leaves you with a large number of low-count tokens which often do not match new spam.  I am proposing an option to take this compression one step further and tokenize any predominantly non-ASCII word as all "?" characters.

That means you only save word length. Funny idea.

>This further reduces the number of low-count tokens in the database and increases the likelihood of new non-ASCII spam being scored correctly.  Users who do receive legitimate non-ASCII email and require more discrimination between non-ASCII words will want to continue to use either the current non-ASCII option, or do no non-ASCII substitution at all.  I suggest that this option should be in addition to, rather than in place of, the current non-ASCII option.  This would allow users to determine the 
>level of non-ASCII substitution that works best for them - none, individual characters, or whole words.

We just worked on reducing options, not introducing.

Again, I really doubt, it is needed. A lot can be achieved
with proper parameters and training.

pi