Levenshtein distance as a useful pattern matching algorithm to decipher scrabble spam

Tue Aug 23 08:03:50 CEST 2005

Opening this thread again, so top-posting to David's last message.

I agree that calculating Levenshtein distance for every unknown token could be quite expensive, but I would like to explore possible
optimizations that would make it worthwhile.  Can you tell me what made you conclude that "token degeneration" was not of much
value?

C

----- Original Message -----
From: "David Relson" <relson at osagesoftware.com>
Cc: <bogofilter at bogofilter.org>
Sent: Saturday, February 19, 2005 5:38 PM
Subject: Re: Levenshtein distance as a useful pattern matching algorithm to decipher scrabble spam

> Hi Chris,
>
> Some interesting reading!  Thanks.
>
> It might be worth an experiment.  At one time, bogofilter had a related
> capability for "token degeneration" -- given an unknown token bogofilter
> would try to simplify it in order to match it with a known token.  For
> example, if "FREE!!!" appeared in a Subject line, the following forms
> would be tried (until a match was found):
>
>    subj:FREE!!! subj:Free!!! subj:free!!!
>    subj:FREE!   subj:Free!   subj:free!
>    subj:FREE    subj:Free    subj:free
>    FREE!!!      Free!!!      free!!!
>    FREE!        Free!        free!
>    FREE         Free         free
>
> Tests ultimately showed that this capability wasn't of much value and it
> got deleted in last year's code cleanup.
>
> AFAICT, the Levenshtein Distance is much more complex. As I understand
> it, it involves comparing an unknown token to _all_ entries in the
> wordlist in order to find the best match (and then using that token's
> info for scoring).  Off-hand that seems expensive - both in CPU time and
> the need to process the full wordlist.  However, if you're so inclined,
> please grab the code from the site and see what you can do. Once you've
> got it running, Greg and I can advise you on running a meaningful test
> to measure its effect on scoring accuracy.
>
> Regards,
>
> David
>
> On Sat, 19 Feb 2005 16:06:58 -0800
> Chris Fortune wrote:
>
> > Spam is more various than ham.  As we all know, the "V" word can be (mis)spelled millions of ways, and foreign character sets
can
> > add many millions more possibilities.  As the once popular game show "Bumper Stumpers" shows, the human mind can recognize even
the
> > most horribly deofrmed splleing of a wrod, but computers that use direct one-to-one pattern matching cannot.
> >
> > in Ham:
> >     James
> >     james
> >
> > in Spam:
> >     J at mes
> >     j@/\/\e5
> >     J4M3S
> >     Jmaes
> >     j_ at _m.ez
> >     ...
> >
> > Enter Levenshtein Distance algorithm:
> > " Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s)
and
> > the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For
> > example,
> >     If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
> >     If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s
into
> > t.
> > The greater the Levenshtein distance, the more different the strings are.
> > Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't
> > spell or pronounce Levenshtein, the metric is also sometimes called edit distance. "
> >     -  http://www.merriampark.com/ld.htm
> >
> > If bogofilter could recognize the similarity between subtle variations of the "V" word, it would not need to see all 10 million
> > variants, but could recognize all variants using a much smaller sample.
> >
>
>
> --
> David Relson                   Osage Software Systems, Inc.
> relson at osagesoftware.com       Ann Arbor, MI 48103
> www.osagesoftware.com          tel:  734.821.8800