invalid html warfare

Gustaf Erikson gustafe at home.se
Wed May 28 09:53:38 CEST 2003


John McCain <jmccain at layer3al.com> writes:

> imho, I think the days of using code-level html to identify spam are
> gone.  I think the only way statistical filters are going to
> continue to be effective is if they see the message exactly as a
> human would.

What are the computational costs of running text/html messages through
a parser (lynx) before processing. It wouldn't have to be very smart
-- a html version of strings(1) should suffice. Messages containing
just a GIF would be automatic spam...

> I've seen a great deal of statistical filter evasion such as the
> examples I cited.  The best case scenario right now seems to be that
> the filter will still catch the message, but our databases will
> gradually degrade with garbage data such as <gmurfoophead>, assuming
> we are maintaining the training database.

How about a utility for scanning the database against the dict db and
a user defined wordlist, presenting words that don't match so that
they could be deleted from the spamlist. If you apply English language
rules for number of consonants etc. it could be pretty smart in
filtering nonsense.

Perhaps. Or this would mean more work for the recipient, no-one would
use it, and the spammer wins. D'oh.

>
> I am beginning to wonder how practical it would be to ban html e-mail for my 
> domain.

You must be the luckiest man alive -- no correspondents sending html
mail!

/g.

-- 
Gustaf Erikson <<< mobile: 073-338 76 18 >>> http://stureby.net/gustaf/
I didn't like the play, but I saw it under adverse conditions. The curtain 
was up.




More information about the Bogofilter mailing list