Levenshtein distance as a useful pattern matching algorithm todecipher scrabble spam

Laurence ljng at hbbs.org
Sun Feb 20 11:21:55 CET 2005


Chris Fortune wrote:
> Spam is more various than ham.  As we all know, the "V" word can be
> (mis)spelled millions of ways, and foreign character sets can add
> many millions more possibilities.  As the once popular game show
> "Bumper Stumpers" shows, the human mind can recognize even the most
> horribly deofrmed splleing of a wrod, but computers that use direct
> one-to-one pattern matching cannot.
>
> in Ham:
>    James
>    james
>
> in Spam:
>    J at mes
>    j@/\/\e5
>    J4M3S
>    Jmaes
>    j_ at _m.ez
>    ...
>
> Enter Levenshtein Distance algorithm:

In your Ham/Spam examples our brains are firstly pattern matching and 
substituting letters:

@     => a
/\/\  => m
5     => s
4     => a
M     => m
3     => e
_     => <null>
.     => <null>

...leaving aside the "mae" => "ame" in the 4th example for a moment. ;)

Our brains also appear to be doing a bit of a dictionary lookup and using 
the context of the word to narrow down possibilities.

On my TiVo I have a program that will overlay a line of text on screen. 
I've got it being called from a script that monitors my incoming email.  To 
squeeze more information on screen the script applies a number of rules to 
shorten the string.  It does a one to one replacement for some names and 
strings in subjects.  It also removes any repeated alphas in a word and 
every vowel that isn't at the start or end of a word.  Still it's almost 
always possible to work out the original words![1]

The letter substitution seems to be a possible new rule for the Bogofilter 
tokenisation process if it doesn't do it already.

Maybe then the dictionary lookup uses Bogofilter's wordlist rather than a 
standard list of words.  The Levenshtein algorithm could be used to weight 
the match of the message token against the wordlist token.

I guess the context matching is a little beyond today's hardware. ;)

Intuitively letter substitution/dictionary matching seems to be a good way 
to catch the punctuation and foreign character filled versions of the "V" 
word.  Personally I'm seeing a lot of that sort of spam in my "Unsure" 
folder.

What does Bogofilter currently do in the way of substitution matching and in 
removal of strings that don't count towards the phonetic value of a word as 
we read it when tokenising messages?

Laurence, yowzers, my brain works on Sunday mornings! ;P


[1]: Here's an example:

String displayed on TiVo: "JFK TVo<[oztvo] Cche Crd & Mmry....~"

...and the process:
Read From  : From: "JFK TiVo" <some.email.address>
Parsed From: JFK TiVo
Full Subj  : Re: [oztivo] Cache Card and Memory.... PLEASE WAIT?
Full string: JFK TiVo<[oztivo] Cache Card and Memory.... PLEASE WAIT?
1st Compressed : JFK TiVo<[oztivo] Cache Card & Memory.... PLEASE WAIT?
2nd Compressed : JFK TVo<[oztvo] Cche Crd & Mmry.... PL



More information about the Bogofilter mailing list