html comment processing
David Relson
relson at osagesoftware.com
Tue Apr 1 01:09:04 CEST 2003
At 06:06 PM 3/31/03, Herman Oosthuysen wrote:
>>> > > <!-->third<-->
>>> >
>>> > Again "<!-->" is a comment declaration with data characters inside.
>>> > "third" is part of the text. It needs to be counted.
>>>
>>>*sign*
>>>
>>>I knew this was going to happen.
>>>">third<" is the comment. That is one valid comment declaration tag.
>>
>>This may be bad html. Better form would be to escape the inner angle
>>brackets, i.e.
>><!-->third<-->
>Yep, if you want "--<third>--" to be a comment, then you have to escape
>the angle brackets to "-->third<--".
>
>If "<!-->third<-->" is to be parsed litterally, then
>"<!-->" is the comment "--",
>"third" is text and
>"<--> is an illegal tag.
I beg to differ. Comments call for two pairs of hyphens. "<!-->" is
nothing described by the spec.
I _will_ concede that the "...third..." is not valid html. However
bogofilter should do something reasonable with it. At the moment, using
strict_check=no it finds one token, i.e. the word "third".
>As from some offline discussions, note this typedef tag at the start of
>every HTML doc:
><!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Strict//EN">
>
>Bogofilter should discard the above typedef construct as a comment.
>
>The three characters "<>&", must be escaped in normal text, since they
>have a special meaning in HTML, so if you ever see those three in a
>document, then they are part of tags/escape sequences.
>
>
>---------------------------------------------------------------------
>FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
>To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
>For summary digest subscription: bogofilter-digest-subscribe at aotto.com
>For more commands, e-mail: bogofilter-help at aotto.com
More information about the Bogofilter
mailing list