HTML entities (Was: Re: mass processing with mutt and Fcc)

Janne Nikula jni at iki.fi
Tue Apr 1 23:16:44 CEST 2003


* Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:
> The problem is that we need HTML processing to avoid the
> spammers' tricks with tags in the middle of words. [...]

Thinking about this lead me to think about other possibilities of
intentional obfuscation in HTML.

I don't recall receiving junk mail like this so far, but one of the ways
to effectively break bogofilter's functionality to analyze HTML messages
is to randomly replace normal characters with numerical entities.


For example in,

    <p>Please buy our product!</p>

the user sees,

    Please buy our product!

and bogofilter sees the tokens,
    
    please buy our product


However simply replacing 'u' with 'u' leads to,

    <p>Please buy our product!</p>

while the user sees,

    Please buy our product!

but bogofilter sees only the tokens,

    please prod


Using this technique even more extensively, for example by replacing all
vowels with numerical HTML entities, bogofilter can make very little use
of the HTML message body.




More information about the Bogofilter mailing list