A suggestion for non-ASCII Scoring
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Mon Jan 26 20:19:58 CET 2004
"Greg McCann" <greg at cambria.com> wrote:
>But for those of us whose language uses ASCII characters almost exclusively, this is wasted space since *any* non-ASCII content indicates a high probability of spam.
Not true. There are quotes (not "), dashes and more. They
might appear not too often, but I do see them. So it is not
*any* non-ASCII. But anyhow, that would not be important in
this case.
In this case it could be more promising to not allow
non-ASCII in the first place.
>It also takes more training to be able to accurately recognize all spammy non-ASCII words.
I doubt that. For languages like German or French it does
not make a difference, there are few words only which only
differ in a non-ASCII character. It might look like true for
Asian languages, but that works surprisingly well with only
very few messages used in training (if you only use
train-on-error you can see that).
>Before the recent patch that David kindly supplied for me, a lot of non-ASCII email would get through my filters because (even using the current non-ASCII substitution) it would contain many words that bogofilter had never seen before and would be scored as neutral.
That would not be a reason for an error. Either you see too
many good or to few bad words. Could be your settings are
not really well chosen.
>Currently, bogofilter's non-ASCII option substitutes any non-ASCII character with "?", so, for example, instead of saving every unique non-ASCII five-letter word in your database, you get tokens like ??A??, ?b???, and ?K??f. However, even this level of compression leaves you with a large number of low-count tokens which often do not match new spam. I am proposing an option to take this compression one step further and tokenize any predominantly non-ASCII word as all "?" characters.
That means you only save word length. Funny idea.
>This further reduces the number of low-count tokens in the database and increases the likelihood of new non-ASCII spam being scored correctly. Users who do receive legitimate non-ASCII email and require more discrimination between non-ASCII words will want to continue to use either the current non-ASCII option, or do no non-ASCII substitution at all. I suggest that this option should be in addition to, rather than in place of, the current non-ASCII option. This would allow users to determine the
>level of non-ASCII substitution that works best for them - none, individual characters, or whole words.
We just worked on reducing options, not introducing.
Again, I really doubt, it is needed. A lot can be achieved
with proper parameters and training.
pi
More information about the Bogofilter
mailing list