What to do for HTML comment processing ???
Suzanne Skinner
tril at igs.net
Fri Mar 7 10:11:59 CET 2003
On Thu, Mar 06, 2003 at 10:31:27PM -0500, David Relson wrote:
> Just to make sure I understand, you want 2b, but not 2a? 2a is where
> href's and urls would live. 2b is the home for javascript, style sheets
> and, also, totally random stuff.
Well, firstly, consider these friendly suggestions rather than user requests
:-) (since I'm not using bogofilter at present). Here are my thoughts in more
detail:
- Grab all URLs, in and out of HTML tags
- Grab stuff inside simple one-line comments (but prefix it to give it
context), since they often contain good spam indicators. For instance, the
less-than-smart spammers who use the same nonsense word over and over in
multiple messages. Also for instance, in my corpus a bunch of spam and zero
ham contains this line:
<!-- saved from url=(0022)http://internet.e-mail -->
(I have no idea what that's about.)
- Don't grab all HTML tokens (or at least make it an option), because this
tends to create a large pool of correlated tokens which show up in most HTML
messages (body, font, div, href, etc). For those like me, who receive a small
but non-zero percentage of HTML ham, this means any unfortunate who mails us
in HTML gets a dangerously big spamminess boost.
In the homebrewn implementation I'm playing with at present, I actually render
HTML to text using "links -dump", which lets me at the "eyespace" of the
message, then comb the HTML afterwards for special clues. Works quite well,
though it isn't nearly as fast as bogofilter :-)
Suzanne
--
tril at igs.net - http://www.igs.net/~tril/
A Pope has a Water Cannon. It is a Water Cannon.
He fires Holy-Water from it. It is a Holy-Water Cannon.
He Blesses it. It is a Holy Holy-Water Cannon.
He Blesses the Hell out of it. It is a Wholly Holy Holy-Water Cannon.
He has it pierced. It is a Holey Wholly Holy Holy-Water Cannon.
Batman and Robin arrive. He shoots them.
-- Principia Discordia
More information about the Bogofilter
mailing list