A suggestion for non-ASCII Scoring

David Relson relson at osagesoftware.com
Fri Jan 23 18:20:24 CET 2004


On Fri, 23 Jan 2004 09:00:05 -0800
Greg McCann wrote:

> In spite of having classified thousands of non-ASCII messages as spam,
> using the "replace_nonascii_characters=yes" option, a couple of
> non-ASCII messages still get through my filter every day.  (bogofilter
> version 0.15.4)
> 
> The problem is that by including ASCII characters embedded within a
> non-ASCII word in the token it creates a large number of singletons
> that aren't effective in filtering new non-ASCII messages.
> 
> For example, bogofilter classifies ???I?, ??F??, and b???? as distinct
> tokens.  Then when I get a message containing ?J???, it is considered
> a new neutral token rather than a spammy token.
> 
> I don't care if it is ???I?, ??F??, b????, or ?J??? - it is all spammy
> to me.  I would like to propose an option to ignore any ASCII
> characters within a mostly non-ASCII word and tokenize it as if the
> word was entirely non-ASCII.  In other words, ???I?, ??F??, b????, and
> ?J??? would all be tokenized as "?????" rather than as distinct
> tokens.  I believe this would greatly improve the effectiveness of my
> non-ASCII spam scoring.
> 
> Best regards,
> 
> 
> Greg McCann

Greg,

It could be done, though I question the value of doing it.

If you want to experiment, I've written a patch that will convert the
symbols as you want.  The change compiles, but I've not run it, so it
may not work.  Test it and let us know if it actually helps.

David
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch.token.c.0123
Type: application/octet-stream
Size: 872 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20040123/71dd16cb/attachment.obj>


More information about the Bogofilter mailing list