A suggestion for non-ASCII Scoring

David Relson relson at osagesoftware.com
Fri Jan 23 19:13:11 CET 2004


On Fri, 23 Jan 2004 09:48:08 -0800
Greg McCann wrote:

> On 1/23/2004 at 12:20 PM David Relson <relson at osagesoftware.com>
> wrote:
> 
> >On Fri, 23 Jan 2004 09:00:05 -0800
> >Greg McCann wrote:
> 
> ...
> >> I would like to propose an option to ignore any ASCII
> >> characters within a mostly non-ASCII word and tokenize it as if the
> >> word was entirely non-ASCII.
> ...
> 
> >If you want to experiment, I've written a patch that will convert the
> >symbols as you want.  The change compiles, but I've not run it, so it
> >may not work.  Test it and let us know if it actually helps.
> 
> Thank you, David.  This looks great.  It compiled fine with the 0.15.4
> source and a preliminary test shows that it does exactly what I was
> hoping for.  I will report back after a few days and let you know if
> it improves the effectiveness of non-ASCII filtering.
> 
> 
> Greg McCann

OK.  I'll be interested in hearing your impressions of effectiveness.  A
more thorough test would involve:

1 - creating two versions of bogofilter (with and without the change)
2 - taking a large set of messages (both ham and spam)
3 - using the two bogofilters and half the messages, create two
wordlists
4 - determine spam_cutoff for the with/without wordlists
5 - score the second half of the messages and count false
positives/negatives

this would give a more accurate indication of how the change affects
scoring.




More information about the Bogofilter mailing list