Levenshtein distance as a useful pattern matching algorithm to decipher scrabble spam

Chris Fortune cfortune at telus.net
Sun Feb 20 01:06:58 CET 2005


Spam is more various than ham.  As we all know, the "V" word can be (mis)spelled millions of ways, and foreign character sets can
add many millions more possibilities.  As the once popular game show "Bumper Stumpers" shows, the human mind can recognize even the
most horribly deofrmed splleing of a wrod, but computers that use direct one-to-one pattern matching cannot.

in Ham:
    James
    james

in Spam:
    J at mes
    j@/\/\e5
    J4M3S
    Jmaes
    j_ at _m.ez
    ...

Enter Levenshtein Distance algorithm:
" Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and
the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For
example,
    If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
    If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into
t.
The greater the Levenshtein distance, the more different the strings are.
Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't
spell or pronounce Levenshtein, the metric is also sometimes called edit distance. "
    -  http://www.merriampark.com/ld.htm

If bogofilter could recognize the similarity between subtle variations of the "V" word, it would not need to see all 10 million
variants, but could recognize all variants using a much smaller sample.




More information about the Bogofilter mailing list