Levenshtein distance as a useful pattern matching algorithm to decipher scrabble spam
Chris Fortune
cfortune at telus.net
Tue Aug 23 08:03:50 CEST 2005
Opening this thread again, so top-posting to David's last message.
I agree that calculating Levenshtein distance for every unknown token could be quite expensive, but I would like to explore possible
optimizations that would make it worthwhile. Can you tell me what made you conclude that "token degeneration" was not of much
value?
C
----- Original Message -----
From: "David Relson" <relson at osagesoftware.com>
Cc: <bogofilter at bogofilter.org>
Sent: Saturday, February 19, 2005 5:38 PM
Subject: Re: Levenshtein distance as a useful pattern matching algorithm to decipher scrabble spam
> Hi Chris,
>
> Some interesting reading! Thanks.
>
> It might be worth an experiment. At one time, bogofilter had a related
> capability for "token degeneration" -- given an unknown token bogofilter
> would try to simplify it in order to match it with a known token. For
> example, if "FREE!!!" appeared in a Subject line, the following forms
> would be tried (until a match was found):
>
> subj:FREE!!! subj:Free!!! subj:free!!!
> subj:FREE! subj:Free! subj:free!
> subj:FREE subj:Free subj:free
> FREE!!! Free!!! free!!!
> FREE! Free! free!
> FREE Free free
>
> Tests ultimately showed that this capability wasn't of much value and it
> got deleted in last year's code cleanup.
>
> AFAICT, the Levenshtein Distance is much more complex. As I understand
> it, it involves comparing an unknown token to _all_ entries in the
> wordlist in order to find the best match (and then using that token's
> info for scoring). Off-hand that seems expensive - both in CPU time and
> the need to process the full wordlist. However, if you're so inclined,
> please grab the code from the site and see what you can do. Once you've
> got it running, Greg and I can advise you on running a meaningful test
> to measure its effect on scoring accuracy.
>
> Regards,
>
> David
>
> On Sat, 19 Feb 2005 16:06:58 -0800
> Chris Fortune wrote:
>
> > Spam is more various than ham. As we all know, the "V" word can be (mis)spelled millions of ways, and foreign character sets
can
> > add many millions more possibilities. As the once popular game show "Bumper Stumpers" shows, the human mind can recognize even
the
> > most horribly deofrmed splleing of a wrod, but computers that use direct one-to-one pattern matching cannot.
> >
> > in Ham:
> > James
> > james
> >
> > in Spam:
> > J at mes
> > j@/\/\e5
> > J4M3S
> > Jmaes
> > j_ at _m.ez
> > ...
> >
> > Enter Levenshtein Distance algorithm:
> > " Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s)
and
> > the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For
> > example,
> > If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
> > If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s
into
> > t.
> > The greater the Levenshtein distance, the more different the strings are.
> > Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't
> > spell or pronounce Levenshtein, the metric is also sometimes called edit distance. "
> > - http://www.merriampark.com/ld.htm
> >
> > If bogofilter could recognize the similarity between subtle variations of the "V" word, it would not need to see all 10 million
> > variants, but could recognize all variants using a much smaller sample.
> >
>
>
> --
> David Relson Osage Software Systems, Inc.
> relson at osagesoftware.com Ann Arbor, MI 48103
> www.osagesoftware.com tel: 734.821.8800
More information about the Bogofilter
mailing list