SPAN style="DISPLAY: none" spams

Sat Jul 23 01:16:06 CEST 2005

On Fri, 22 Jul 2005 the voices made Tom Anderson write:

TA> From: "Tony L. Svanstrom" <tony at moon.pp.se>
TA> > One of the things that I'm considering is to either delete all HTML-parts
TA> > (and
TA> > HTML-only e-mails), rejecting them or in one way or another refuse this
TA> > "HTML-
TA> > mess" to even reach as far as where bogofilter must "guess" on if the
TA> > e-mail is
TA> > spam or ham.
TA>
TA> Why?  HTML emails are the easiest to classify because spammers can't help
TA> but try to do tricks with them.  Detect just one trick, and you've got a
TA> 100% ID as spam.  The most common and unavoidable is this: <a
TA> href="spammer.com">paypal.com</a>.  Now I'm not suggesting that you have to
TA> run SpamAssassin or a hundred procmail recipes, but I'll definitely expend a
TA> little extra processing power to detect such flagrant and persistent scams.
TA> Meanwhile, I quite enjoy my HTML newsletters.

 I'm somewhat old school when it comes to e-mail; but I don't even have to
refuse to read HTML e-mails as all of the ones I'm getting either include a
full plain/text-part, or at least a plain/text-part with a link to the same
content on the web.
 There might be one or two companies sending me HTML-only e-mails; but that's
basically just friendly spam, ie ads I'm getting simply because I once bought
something from those companies... If it's quickly done I unsubscribe, or I just
add them to my blacklist; I won't waste my time reading an unwanted formated
e-mail trying to trick me into buying something I probably don't need, just
like I won't buy things from onlinestores which are IE-only.

 Using those tricks that you are talking about to detect spam is great if
you're either writing a general filter to be used by others (I wrote a rule or
two for SA, before giving up on such a general solution; this was before
bayesian filtering in SA), or if you really want HTML in your e-mails; but by
removing the HTML-parts from the spam-problem you're free to focus on filtering
based on headers and what's actually 100% visible to the reader.

 Using the headers I can quickly, and with 100% certainty, find some of the ham
and some/a lot of the spam, which I of course will use to train a bayesian
filter; and I can use bayesian filtering without having to worry about if it
will waste time on/be tricked by "invisible" parts of the e-mail.

 Ignoring HTML will make your filter(ing stage) way less sensitive when it
comes to the evolution of spam.

	/Tony
-- 
        /\___/\                                          /\___/\
        \_@ @_/                                          \_@ @_/
   .--oOO-(_)-OOo--------------------------------------oOO-(_)-OOo--.
   |  perl -e'print$_{$_} for sort%_=`lynx -dump svanstrom.com/t`'  |
   `---ôôô---ôôô----------------------------------------ôôô---ôôô---´
       \O/   \O/        ©1998-2005 svanstrom.com        \O/   \O/