Funny characters [was: Levenshtein distance ...]
David Relson
relson at osagesoftware.com
Fri Feb 25 14:36:50 CET 2005
On Fri, 25 Feb 2005 08:28:05 -0000
Lee Dowthwaite wrote:
> Apologies for the late input to this thread.
>
> Chris Fortune wrote:
> >in Ham:
> > James
> > james>
> >
> >in Spam:
> > J at mes
> > j@/\/\e5
> > J4M3S
> > Jmaes
> > j_ at _m.ez
> > ...
>
> A thought: I don't know what bogofilter currently does with non-alphabetic
> chars, but I would have thought that - except for "Jmaes" - marking these
> words as potential spam would be trivial simply because of the occurrence of
> "funny characters" mid-token?
>
> For a while I experimented with a Bayesian algo based on digraph and
> trigraph letter sequences: the idea was to maintain a fixed-size DB suitable
> for embedded applications. Ultimately it couldn't outperform a token-based
> system, and its accuracy was severely compromised by the "convolution
> factor" that some very hammy words shared di/trigraphs with very spammy
> words.
>
> However, one thing it was exceptionally good at was handling this kind of
> obfuscation: most of the common V-word forms were caught by it, and with a
> DB of only a few K. Looking into the matrix I could see that alphabetic
> characters adjacent to non-alphabetic characters were a rich hunting ground
> for spam: indeed, almost all such juxtapositions were the result of spam.
> Another thing it was very good at spotting - again, with minimal DB usage -
> was foreign content. On these grounds it may well have caught the "Jmaes"
> example also.
>
> Lee
Hi Lee,
Indeed, you have a point. We've long known that deliberate misspellings
work once, and once only. Once bogofilter is told that v1agra is bad,
the use of that spelling is a red flag saying "Spam here!!!". Including
funny characters mid-token might be useful.
If you want to experiment, take a look at file lexer_v3.l. In lines
149-152 patterns TOKENFRONT, TOKENMID, and TOKENBACK are defined. You
can modify TOKENMID, rebuild and see what happens.
Enjoy!
David
_______________________________________________
Bogofilter mailing list
Bogofilter at bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter
More information about the Bogofilter
mailing list