Levenshtein distance as a useful pattern matching algorithm todecipher scrabble spam

Lee Dowthwaite lee at dowthwaite.net
Sat Feb 26 15:11:15 CET 2005


Very true, although in my corpus it flagged far more spam having these
properties than ham. But I do recall some nasty problems with filenames, now
that you mention it - indeed, this was the kind of convolution that made me
give up trying with Xgraph sequences altogether. It was the 80:20 rule: the
Xgraphs categorised 80% of messages effortlessly and with a tiny DB, but the
remaining 20% caused all the problems.

Lee


-----Original Message-----
From: bogofilter-bounces+lee=dowthwaite.net at bogofilter.org
[mailto:bogofilter-bounces+lee=dowthwaite.net at bogofilter.org]On Behalf
Of Edvard Majakari
Sent: 25 February 2005 08:31
To: bogofilter at bogofilter.org
Subject: Re: Levenshtein distance as a useful pattern matching algorithm
todecipher scrabble spam


"Lee Dowthwaite" <lee at dowthwaite.net> writes:

[...]
> for spam: indeed, almost all such juxtapositions were the result of spam.
> Another thing it was very good at spotting - again, with minimal DB
usage -
> was foreign content. On these grounds it may well have caught the "Jmaes"
> example also.

What about code? Wouldn't procmail recipes, perl code, sendmail
configuration files etc. in e-mail seem like spam then?

--
# Edvard Majakari		Software Engineer
# PGP PUBLIC KEY available    	Soli Deo Gloria!

$_ = '456476617264204d616a616b6172692c20612043687269737469616e20'; print
join('',map{chr hex}(split/(\w{2})/)),uc
substr(crypt(60281449,'es'),2,4),"\n";
_______________________________________________
Bogofilter mailing list
Bogofilter at bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter

_______________________________________________
Bogofilter mailing list
Bogofilter at bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter



More information about the Bogofilter mailing list