Levenshtein distance as a useful pattern matching algorithm todecipher scrabble spam

Sun Feb 20 13:32:56 CET 2005

On Sun, 20 Feb 2005 10:21:55 -0000
Laurence wrote:

> Chris Fortune wrote:
> > Spam is more various than ham.  As we all know, the "V" word can be
> > (mis)spelled millions of ways, and foreign character sets can add
> > many millions more possibilities.  As the once popular game show
> > "Bumper Stumpers" shows, the human mind can recognize even the most
> > horribly deofrmed splleing of a wrod, but computers that use direct
> > one-to-one pattern matching cannot.
> >
> > in Ham:
> >    James
> >    james
> >
> > in Spam:
> >    J at mes
> >    j@/\/\e5
> >    J4M3S
> >    Jmaes
> >    j_ at _m.ez
> >    ...
> >
> > Enter Levenshtein Distance algorithm:
> 
> In your Ham/Spam examples our brains are firstly pattern matching and 
> substituting letters:
> 
> @     => a
> /\/\  => m
> 5     => s
> 4     => a
> M     => m
> 3     => e
> _     => <null>
> .     => <null>
> 
> ...leaving aside the "mae" => "ame" in the 4th example for a moment. ;)
> 
> Our brains also appear to be doing a bit of a dictionary lookup and using 
> the context of the word to narrow down possibilities.

...[snip]...

> The letter substitution seems to be a possible new rule for the Bogofilter 
> tokenisation process if it doesn't do it already.
> 
> Maybe then the dictionary lookup uses Bogofilter's wordlist rather than a 
> standard list of words.  The Levenshtein algorithm could be used to weight 
> the match of the message token against the wordlist token.
> 
> I guess the context matching is a little beyond today's hardware. ;)
> 
> Intuitively letter substitution/dictionary matching seems to be a good way 
> to catch the punctuation and foreign character filled versions of the "V" 
> word.  Personally I'm seeing a lot of that sort of spam in my "Unsure" 
> folder.
> 
> What does Bogofilter currently do in the way of substitution matching and in 
> removal of strings that don't count towards the phonetic value of a word as 
> we read it when tokenising messages?

Bogofilter has minimal substituion capabilities.  A few special
characters (with values above 0x80) are mapped to characters below 0x80.
For example in charset.c is function map_iso_8859_15.  It maps several
money symbols, i.e. the pound, cent, euro, and yen characters, to the
dollar sign. There's also the replace_nonascii_characters option which
can be used to convert characters above 0x80 to question marks -
something I've found useful in processing asian spam.

Bogofilter doesn't do substitutions like "@" to "a" or "/\/\" to "m".

HTH,

David