Levenshtein distance as a useful pattern matching algorithm todecipher scrabble spam
David Relson
relson at osagesoftware.com
Sun Feb 20 13:32:56 CET 2005
On Sun, 20 Feb 2005 10:21:55 -0000
Laurence wrote:
> Chris Fortune wrote:
> > Spam is more various than ham. As we all know, the "V" word can be
> > (mis)spelled millions of ways, and foreign character sets can add
> > many millions more possibilities. As the once popular game show
> > "Bumper Stumpers" shows, the human mind can recognize even the most
> > horribly deofrmed splleing of a wrod, but computers that use direct
> > one-to-one pattern matching cannot.
> >
> > in Ham:
> > James
> > james
> >
> > in Spam:
> > J at mes
> > j@/\/\e5
> > J4M3S
> > Jmaes
> > j_ at _m.ez
> > ...
> >
> > Enter Levenshtein Distance algorithm:
>
> In your Ham/Spam examples our brains are firstly pattern matching and
> substituting letters:
>
> @ => a
> /\/\ => m
> 5 => s
> 4 => a
> M => m
> 3 => e
> _ => <null>
> . => <null>
>
> ...leaving aside the "mae" => "ame" in the 4th example for a moment. ;)
>
> Our brains also appear to be doing a bit of a dictionary lookup and using
> the context of the word to narrow down possibilities.
...[snip]...
> The letter substitution seems to be a possible new rule for the Bogofilter
> tokenisation process if it doesn't do it already.
>
> Maybe then the dictionary lookup uses Bogofilter's wordlist rather than a
> standard list of words. The Levenshtein algorithm could be used to weight
> the match of the message token against the wordlist token.
>
> I guess the context matching is a little beyond today's hardware. ;)
>
> Intuitively letter substitution/dictionary matching seems to be a good way
> to catch the punctuation and foreign character filled versions of the "V"
> word. Personally I'm seeing a lot of that sort of spam in my "Unsure"
> folder.
>
> What does Bogofilter currently do in the way of substitution matching and in
> removal of strings that don't count towards the phonetic value of a word as
> we read it when tokenising messages?
Bogofilter has minimal substituion capabilities. A few special
characters (with values above 0x80) are mapped to characters below 0x80.
For example in charset.c is function map_iso_8859_15. It maps several
money symbols, i.e. the pound, cent, euro, and yen characters, to the
dollar sign. There's also the replace_nonascii_characters option which
can be used to convert characters above 0x80 to question marks -
something I've found useful in processing asian spam.
Bogofilter doesn't do substitutions like "@" to "a" or "/\/\" to "m".
HTH,
David
More information about the Bogofilter
mailing list