html comment processing [was: database rebuild]

David Relson relson at osagesoftware.com
Sat Feb 8 18:55:11 CET 2003


At 03:41 AM 2/8/03, Nick Simicich wrote:

>By the way, as far as comment syntax goes:  Piotr KUCHARSKI ran the page 
>through Opera 6 and sent me the output: It gets all cases correct except 
>for 8, the nested comment case, strengthening the case against closing 
>comments on naked >.  I noted that there was another text based browser 
>included in Linux when Redhat released a fix for it - w3m, which is 
>supposed to "do the right thing" when displaying html and plain text. It 
>works properly with cases 1-7, and does what Opera does with case 8: It 
>closes on the first -->, not properly dealing with nested comments.

Nick,

As a geek, I find standards to be very useful and like software that is 
compliant and dislike software that is buggy or non-compliant.  As a spam 
fighter, it's necessary to deal with messages that don't comply with the 
standards.  This is called reality.

The question at issue here is how to handle html comments.  As you 
mentionned previously, the pages below have good info on the proper form 
for them:

         http://www.htmlhelp.com/reference/wilbur/misc/comment.html
         http://www.w3.org/MarkUp/html-spec/html-spec_3.html#SEC3.2.5


Bogofilter's code could be written to be standards compliant - check for 
the comment declaration strings, i.e. "<!" and ">", with optional 
whitespace, and for the actual comment brackets, i.e. the pairs of dashes 
("--") that begin and end the actual comment.  The code could also handle 
nested comment levels.

As your experiments have shown, nesting comments isn't done by the 
browsers, so bogofilter needn't do that.

Proper checking for the comment declaration strings ( "<!" and ">") is more 
complex.  The initial implementation in bogofilter checked for "<!--" and 
"-->", but we discovered that spammers sometimes used ">" to end the 
comment.  To deal with this bogofilter changed.  At present I'm thinking 
that bogofilter should do proper checking (by default) and have an option 
for the other way, perhaps "relaxed_html_comments=true".

David






More information about the Bogofilter mailing list