html comment processing [was: database rebuild]
David Relson
relson at osagesoftware.com
Sat Feb 8 18:55:11 CET 2003
At 03:41 AM 2/8/03, Nick Simicich wrote:
>By the way, as far as comment syntax goes: Piotr KUCHARSKI ran the page
>through Opera 6 and sent me the output: It gets all cases correct except
>for 8, the nested comment case, strengthening the case against closing
>comments on naked >. I noted that there was another text based browser
>included in Linux when Redhat released a fix for it - w3m, which is
>supposed to "do the right thing" when displaying html and plain text. It
>works properly with cases 1-7, and does what Opera does with case 8: It
>closes on the first -->, not properly dealing with nested comments.
Nick,
As a geek, I find standards to be very useful and like software that is
compliant and dislike software that is buggy or non-compliant. As a spam
fighter, it's necessary to deal with messages that don't comply with the
standards. This is called reality.
The question at issue here is how to handle html comments. As you
mentionned previously, the pages below have good info on the proper form
for them:
http://www.htmlhelp.com/reference/wilbur/misc/comment.html
http://www.w3.org/MarkUp/html-spec/html-spec_3.html#SEC3.2.5
Bogofilter's code could be written to be standards compliant - check for
the comment declaration strings, i.e. "<!" and ">", with optional
whitespace, and for the actual comment brackets, i.e. the pairs of dashes
("--") that begin and end the actual comment. The code could also handle
nested comment levels.
As your experiments have shown, nesting comments isn't done by the
browsers, so bogofilter needn't do that.
Proper checking for the comment declaration strings ( "<!" and ">") is more
complex. The initial implementation in bogofilter checked for "<!--" and
"-->", but we discovered that spammers sometimes used ">" to end the
comment. To deal with this bogofilter changed. At present I'm thinking
that bogofilter should do proper checking (by default) and have an option
for the other way, perhaps "relaxed_html_comments=true".
David
More information about the Bogofilter
mailing list