A suggestion for non-ASCII Scoring
Greg McCann
greg at cambria.com
Fri Jan 23 19:58:06 CET 2004
On 1/23/2004 at 1:13 PM David Relson <relson at osagesoftware.com> wrote:
>OK. I'll be interested in hearing your impressions of effectiveness. A
>more thorough test would involve:
>
>1 - creating two versions of bogofilter (with and without the change)
>2 - taking a large set of messages (both ham and spam)
>3 - using the two bogofilters and half the messages, create two
>wordlists
>4 - determine spam_cutoff for the with/without wordlists
>5 - score the second half of the messages and count false
>positives/negatives
>
>this would give a more accurate indication of how the change affects
>scoring.
That is true. Unfortunately I don't keep old messages. Most of the spam that I use for training bogofilter (about 25,000 new messages per month) comes from spamtrap email addresses that are automatically filtered through "bogofilter -s" then discarded. Still, based on the patterns I have observed, I suspect that this change will let bogofilter zap most of the non-ASCII spam that has been sneaking into my inbox. It will also reduce the size of wordlist.db significantly.
The only situation where I could see this not being helpful is for users that receive legitimate email containing a lot of non-ASCII characters. In that case, they may want to continue scoring non-ASCII words as distinct tokens.
Not to make things too complicated, but users could have maximum flexibility in handling non-ASCII messages with a three-level scoring option:
replace_nonascii_characters=no no non-ASCII substitution
replace_nonascii_characters=yes substitute individual non-ASCII characters with ?
replace_nonascii_characters=whole_word tokenize the whole non-ASCII word as ?...
Greg McCann
More information about the Bogofilter
mailing list