Levenshtein distance as a useful pattern matching algorithm to decipher scrabble spam

Tue Aug 23 12:59:11 CEST 2005

On 20050822 (Mon) at 2303:50 -0700, Chris Fortune wrote:

> I agree that calculating Levenshtein distance for every unknown token
> could be quite expensive, but I would like to explore possible
> optimizations that would make it worthwhile.

It's been suggested that humans easily recognize words that have been
scrambled, provided that the fsrit and lsat ltetres are kpet in pcale.
For your purposes, it might be that nailing down the first and last
letters and the string length (maybe plus or minus one or two) would
suffice to produce a manageable but still useful set of possible
alternate tokens for which to calculate the ld.  For production use
we'd need a fast search algorithm to dig out those matches, but for
proof of concept speed doesn't much matter.

> Can you tell me what made you conclude that "token degeneration" was
> not of much value?

I see I never wrote it up for my web page on bogofilter, but David and
I ran a few experiments testing the accuracy of bogofilter with and
without degeneration enabled.  The conclusion was that there was no
statistically significant difference.  This, however, was in the days
before ESF, and it's conceivable that redesigning the experiments to
work with ESF enabled would produce a better test.

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |