html comment processing

David Relson relson at osagesoftware.com
Tue Apr 1 01:09:04 CEST 2003


At 06:06 PM 3/31/03, Herman Oosthuysen wrote:

>>> > > <!-->third<-->
>>> >
>>> > Again "<!-->" is a comment declaration with data characters inside.
>>> > "third" is part of the text. It needs to be counted.
>>>
>>>*sign*
>>>
>>>I knew this was going to happen.
>>>">third<" is the comment. That is one valid comment declaration tag.
>>
>>This may be bad html.  Better form would be to escape the inner angle 
>>brackets, i.e.
>><!-->third<-->
>Yep, if you want "--<third>--" to be a comment, then you have to escape 
>the angle brackets to "-->third<--".
>
>If "<!-->third<-->" is to be parsed litterally, then
>"<!-->" is the comment "--",
>"third" is text and
>"<--> is an illegal tag.

I beg to differ.  Comments call for two pairs of hyphens.  "<!-->" is 
nothing described by the spec.

I _will_ concede that the "...third..." is not valid html.  However 
bogofilter should do something reasonable with it.  At the moment, using 
strict_check=no it finds one token, i.e. the word "third".

>As from some offline discussions, note this typedef tag at the start of 
>every HTML doc:
><!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Strict//EN">
>
>Bogofilter should discard the above typedef construct as a comment.
>
>The three characters "<>&", must be escaped in normal text, since they 
>have a special meaning in HTML, so if you ever see those three in a 
>document, then they are part of tags/escape sequences.
>
>
>---------------------------------------------------------------------
>FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
>To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
>For summary digest subscription: bogofilter-digest-subscribe at aotto.com
>For more commands, e-mail: bogofilter-help at aotto.com





More information about the Bogofilter mailing list