A suggestion for non-ASCII Scoring

Fri Jan 23 19:58:06 CET 2004

On 1/23/2004 at 1:13 PM David Relson <relson at osagesoftware.com> wrote:

>OK.  I'll be interested in hearing your impressions of effectiveness.  A
>more thorough test would involve:
>
>1 - creating two versions of bogofilter (with and without the change)
>2 - taking a large set of messages (both ham and spam)
>3 - using the two bogofilters and half the messages, create two
>wordlists
>4 - determine spam_cutoff for the with/without wordlists
>5 - score the second half of the messages and count false
>positives/negatives
>
>this would give a more accurate indication of how the change affects
>scoring.

That is true.  Unfortunately I don't keep old messages.  Most of the spam that I use for training bogofilter (about 25,000 new messages per month) comes from spamtrap email addresses that are automatically filtered through "bogofilter -s" then discarded.  Still, based on the patterns I have observed, I suspect that this change will let bogofilter zap most of the non-ASCII spam that has been sneaking into my inbox.  It will also reduce the size of wordlist.db significantly.

The only situation where I could see this not being helpful is for users that receive legitimate email containing a lot of non-ASCII characters.  In that case, they may want to continue scoring non-ASCII words as distinct tokens.

Not to make things too complicated, but users could have maximum flexibility in handling non-ASCII messages with a three-level scoring option:

replace_nonascii_characters=no            no non-ASCII substitution
replace_nonascii_characters=yes           substitute individual non-ASCII characters with ?
replace_nonascii_characters=whole_word    tokenize the whole non-ASCII word as ?...

Greg McCann