What to do for HTML comment processing ???

Suzanne Skinner tril at igs.net
Fri Mar 7 10:11:59 CET 2003


On Thu, Mar 06, 2003 at 10:31:27PM -0500, David Relson wrote:

> Just to make sure I understand, you want 2b, but not 2a?  2a is where
> href's and urls would live.  2b is the home for javascript, style sheets
> and, also, totally random stuff.

Well, firstly, consider these friendly suggestions rather than user requests
:-) (since I'm not using bogofilter at present). Here are my thoughts in more
detail:

- Grab all URLs, in and out of HTML tags

- Grab stuff inside simple one-line comments (but prefix it to give it
  context), since they often contain good spam indicators. For instance, the
  less-than-smart spammers who use the same nonsense word over and over in
  multiple messages. Also for instance, in my corpus a bunch of spam and zero
  ham contains this line:

  <!-- saved from url=(0022)http://internet.e-mail -->

  (I have no idea what that's about.)

- Don't grab all HTML tokens (or at least make it an option), because this
  tends to create a large pool of correlated tokens which show up in most HTML
  messages (body, font, div, href, etc). For those like me, who receive a small
  but non-zero percentage of HTML ham, this means any unfortunate who mails us
  in HTML gets a dangerously big spamminess boost.

In the homebrewn implementation I'm playing with at present, I actually render
HTML to text using "links -dump", which lets me at the "eyespace" of the
message, then comb the HTML afterwards for special clues. Works quite well,
though it isn't nearly as fast as bogofilter :-)

Suzanne

-- 
tril at igs.net - http://www.igs.net/~tril/

A Pope has a Water Cannon.                               It is a Water Cannon.
He fires Holy-Water from it.                        It is a Holy-Water Cannon.
He Blesses it.                                 It is a Holy Holy-Water Cannon.
He Blesses the Hell out of it.          It is a Wholly Holy Holy-Water Cannon.
He has it pierced.                It is a Holey Wholly Holy Holy-Water Cannon.
Batman and Robin arrive.                                       He shoots them.
                                    -- Principia Discordia




More information about the Bogofilter mailing list