Levenshtein distance as a useful pattern matching algorithm todecipher scrabble spam

Fri Feb 25 09:28:05 CET 2005

Apologies for the late input to this thread.

Chris Fortune wrote:
>in Ham:
>    James
>    james>
>
>in Spam:
>    J at mes
>    j@/\/\e5
>    J4M3S
>    Jmaes
>    j_ at _m.ez
>    ...

A thought: I don't know what bogofilter currently does with non-alphabetic
chars, but I would have thought that - except for "Jmaes" - marking these
words as potential spam would be trivial simply because of the occurrence of
"funny characters" mid-token?

For a while I experimented with a Bayesian algo based on digraph and
trigraph letter sequences: the idea was to maintain a fixed-size DB suitable
for embedded applications. Ultimately it couldn't outperform a token-based
system, and its accuracy was severely compromised by the "convolution
factor" that some very hammy words shared di/trigraphs with very spammy
words.

However, one thing it was exceptionally good at was handling this kind of
obfuscation: most of the common V-word forms were caught by it, and with a
DB of only a few K. Looking into the matrix I could see that alphabetic
characters adjacent to non-alphabetic characters were a rich hunting ground
for spam: indeed, almost all such juxtapositions were the result of spam.
Another thing it was very good at spotting - again, with minimal DB usage -
was foreign content. On these grounds it may well have caught the "Jmaes"
example also.

Lee

-----Original Message-----
From: bogofilter-bounces+lee=dowthwaite.net at bogofilter.org
[mailto:bogofilter-bounces+lee=dowthwaite.net at bogofilter.org]On Behalf
Of Chris Fortune
Sent: 20 February 2005 00:07
To: bogofilter at bogofilter.org
Subject: Levenshtein distance as a useful pattern matching algorithm
todecipher scrabble spam

Spam is more various than ham.  As we all know, the "V" word can be
(mis)spelled millions of ways, and foreign character sets can
add many millions more possibilities.  As the once popular game show "Bumper
Stumpers" shows, the human mind can recognize even the
most horribly deofrmed splleing of a wrod, but computers that use direct
one-to-one pattern matching cannot.

in Ham:
    James
    james

in Spam:
    J at mes
    j@/\/\e5
    J4M3S
    Jmaes
    j_ at _m.ez
    ...

Enter Levenshtein Distance algorithm:
" Levenshtein distance (LD) is a measure of the similarity between two
strings, which we will refer to as the source string (s) and
the target string (t). The distance is the number of deletions, insertions,
or substitutions required to transform s into t. For
example,
    If s is "test" and t is "test", then LD(s,t) = 0, because no
transformations are needed. The strings are already identical.
    If s is "test" and t is "tent", then LD(s,t) = 1, because one
substitution (change "s" to "n") is sufficient to transform s into
t.
The greater the Levenshtein distance, the more different the strings are.
Levenshtein distance is named after the Russian scientist Vladimir
Levenshtein, who devised the algorithm in 1965. If you can't
spell or pronounce Levenshtein, the metric is also sometimes called edit
distance. "
    -  http://www.merriampark.com/ld.htm

If bogofilter could recognize the similarity between subtle variations of
the "V" word, it would not need to see all 10 million
variants, but could recognize all variants using a much smaller sample.

_______________________________________________
Bogofilter mailing list
Bogofilter at bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter

_______________________________________________
Bogofilter mailing list
Bogofilter at bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter