Levenshtein distance as a useful pattern matching algorithm to decipher scrabble spam

Sun Feb 20 01:38:14 CET 2005

Hi Chris,

Some interesting reading!  Thanks.

It might be worth an experiment.  At one time, bogofilter had a related
capability for "token degeneration" -- given an unknown token bogofilter
would try to simplify it in order to match it with a known token.  For
example, if "FREE!!!" appeared in a Subject line, the following forms
would be tried (until a match was found):

   subj:FREE!!! subj:Free!!! subj:free!!!
   subj:FREE!   subj:Free!   subj:free!
   subj:FREE    subj:Free    subj:free
   FREE!!!      Free!!!      free!!!
   FREE!        Free!        free!
   FREE         Free         free

Tests ultimately showed that this capability wasn't of much value and it
got deleted in last year's code cleanup.

AFAICT, the Levenshtein Distance is much more complex. As I understand
it, it involves comparing an unknown token to _all_ entries in the
wordlist in order to find the best match (and then using that token's
info for scoring).  Off-hand that seems expensive - both in CPU time and
the need to process the full wordlist.  However, if you're so inclined,
please grab the code from the site and see what you can do. Once you've
got it running, Greg and I can advise you on running a meaningful test
to measure its effect on scoring accuracy.

Regards,

David

On Sat, 19 Feb 2005 16:06:58 -0800
Chris Fortune wrote:

> Spam is more various than ham.  As we all know, the "V" word can be (mis)spelled millions of ways, and foreign character sets can
> add many millions more possibilities.  As the once popular game show "Bumper Stumpers" shows, the human mind can recognize even the
> most horribly deofrmed splleing of a wrod, but computers that use direct one-to-one pattern matching cannot.
> 
> in Ham:
>     James
>     james
> 
> in Spam:
>     J at mes
>     j@/\/\e5
>     J4M3S
>     Jmaes
>     j_ at _m.ez
>     ...
> 
> Enter Levenshtein Distance algorithm:
> " Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and
> the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For
> example,
>     If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
>     If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into
> t.
> The greater the Levenshtein distance, the more different the strings are.
> Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't
> spell or pronounce Levenshtein, the metric is also sometimes called edit distance. "
>     -  http://www.merriampark.com/ld.htm
> 
> If bogofilter could recognize the similarity between subtle variations of the "V" word, it would not need to see all 10 million
> variants, but could recognize all variants using a much smaller sample.
> 
> _______________________________________________
> Bogofilter mailing list
> Bogofilter at bogofilter.org
> http://www.bogofilter.org/mailman/listinfo/bogofilter

-- 
David Relson                   Osage Software Systems, Inc.
relson at osagesoftware.com       Ann Arbor, MI 48103
www.osagesoftware.com          tel:  734.821.8800