Levenshtein distance as a useful pattern matching algorithm
todecipher scrabble spam
ljng at hbbs.org
Sun Feb 20 05:21:55 EST 2005
Chris Fortune wrote:
> Spam is more various than ham. As we all know, the "V" word can be
> (mis)spelled millions of ways, and foreign character sets can add
> many millions more possibilities. As the once popular game show
> "Bumper Stumpers" shows, the human mind can recognize even the most
> horribly deofrmed splleing of a wrod, but computers that use direct
> one-to-one pattern matching cannot.
> in Ham:
> in Spam:
> J at mes
> j_ at _m.ez
> Enter Levenshtein Distance algorithm:
In your Ham/Spam examples our brains are firstly pattern matching and
@ => a
/\/\ => m
5 => s
4 => a
M => m
3 => e
_ => <null>
. => <null>
...leaving aside the "mae" => "ame" in the 4th example for a moment. ;)
Our brains also appear to be doing a bit of a dictionary lookup and using
the context of the word to narrow down possibilities.
On my TiVo I have a program that will overlay a line of text on screen.
I've got it being called from a script that monitors my incoming email. To
squeeze more information on screen the script applies a number of rules to
shorten the string. It does a one to one replacement for some names and
strings in subjects. It also removes any repeated alphas in a word and
every vowel that isn't at the start or end of a word. Still it's almost
always possible to work out the original words!
The letter substitution seems to be a possible new rule for the Bogofilter
tokenisation process if it doesn't do it already.
Maybe then the dictionary lookup uses Bogofilter's wordlist rather than a
standard list of words. The Levenshtein algorithm could be used to weight
the match of the message token against the wordlist token.
I guess the context matching is a little beyond today's hardware. ;)
Intuitively letter substitution/dictionary matching seems to be a good way
to catch the punctuation and foreign character filled versions of the "V"
word. Personally I'm seeing a lot of that sort of spam in my "Unsure"
What does Bogofilter currently do in the way of substitution matching and in
removal of strings that don't count towards the phonetic value of a word as
we read it when tokenising messages?
Laurence, yowzers, my brain works on Sunday mornings! ;P
: Here's an example:
String displayed on TiVo: "JFK TVo<[oztvo] Cche Crd & Mmry....~"
...and the process:
Read From : From: "JFK TiVo" <some.email.address>
Parsed From: JFK TiVo
Full Subj : Re: [oztivo] Cache Card and Memory.... PLEASE WAIT?
Full string: JFK TiVo<[oztivo] Cache Card and Memory.... PLEASE WAIT?
1st Compressed : JFK TiVo<[oztivo] Cache Card & Memory.... PLEASE WAIT?
2nd Compressed : JFK TVo<[oztvo] Cche Crd & Mmry.... PL
More information about the Bogofilter