Funny characters [was: Levenshtein distance ...]

David Relson relson at osagesoftware.com
Fri Feb 25 14:36:50 CET 2005


On Fri, 25 Feb 2005 08:28:05 -0000
Lee Dowthwaite wrote:

> Apologies for the late input to this thread.
> 
> Chris Fortune wrote:
> >in Ham:
> >    James
> >    james>
> >
> >in Spam:
> >    J at mes
> >    j@/\/\e5
> >    J4M3S
> >    Jmaes
> >    j_ at _m.ez
> >    ...
> 
> A thought: I don't know what bogofilter currently does with non-alphabetic
> chars, but I would have thought that - except for "Jmaes" - marking these
> words as potential spam would be trivial simply because of the occurrence of
> "funny characters" mid-token?
> 
> For a while I experimented with a Bayesian algo based on digraph and
> trigraph letter sequences: the idea was to maintain a fixed-size DB suitable
> for embedded applications. Ultimately it couldn't outperform a token-based
> system, and its accuracy was severely compromised by the "convolution
> factor" that some very hammy words shared di/trigraphs with very spammy
> words.
> 
> However, one thing it was exceptionally good at was handling this kind of
> obfuscation: most of the common V-word forms were caught by it, and with a
> DB of only a few K. Looking into the matrix I could see that alphabetic
> characters adjacent to non-alphabetic characters were a rich hunting ground
> for spam: indeed, almost all such juxtapositions were the result of spam.
> Another thing it was very good at spotting - again, with minimal DB usage -
> was foreign content. On these grounds it may well have caught the "Jmaes"
> example also.
> 
> Lee

Hi Lee,

Indeed, you have a point.  We've long known that deliberate misspellings
work once, and once only.  Once bogofilter is told that v1agra is bad,
the use of that spelling is a red flag saying "Spam here!!!".  Including
funny characters mid-token might be useful.  

If you want to experiment, take a look at file lexer_v3.l.  In lines
149-152 patterns TOKENFRONT, TOKENMID, and TOKENBACK are defined.  You
can modify TOKENMID, rebuild and see what happens.

Enjoy!

David
_______________________________________________
Bogofilter mailing list
Bogofilter at bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter



More information about the Bogofilter mailing list