html comment processing

Herman Oosthuysen Herman at WirelessNetworksInc.com
Tue Apr 1 01:06:35 CEST 2003


>> > > <!-->third<-->
>> >
>> > Again "<!-->" is a comment declaration with data characters inside.
>> > "third" is part of the text. It needs to be counted.
>>
>> *sign*
>>
>> I knew this was going to happen.
>> ">third<" is the comment. That is one valid comment declaration tag.
> 
> 
> This may be bad html.  Better form would be to escape the inner angle 
> brackets, i.e.
> 
> <!-->third<-->
Yep, if you want "--<third>--" to be a comment, then you have to escape 
the angle brackets to "-->third<--".

If "<!-->third<-->" is to be parsed litterally, then
"<!-->" is the comment "--",
"third" is text and
"<--> is an illegal tag.

As from some offline discussions, note this typedef tag at the start of 
every HTML doc:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Strict//EN">

Bogofilter should discard the above typedef construct as a comment.

The three characters "<>&", must be escaped in normal text, since they 
have a special meaning in HTML, so if you ever see those three in a 
document, then they are part of tags/escape sequences.





More information about the Bogofilter mailing list