html comment processing

David Relson relson at osagesoftware.com
Mon Mar 31 20:06:53 CEST 2003


At 11:21 AM 3/31/03, Herman Oosthuysen wrote:

>Well, read this snippet from the spec again, slowly:
>
>"To include comments in an HTML document, use a comment declaration. A 
>comment declaration consists of `<!' followed by zero or more comments 
>followed by `>'. Each comment starts with `--' and includes all text up to 
>and including the next occurrence of `--'.
>
>What it says to me, is that the comment delimiter is <! and > and that the 
>comment should start and end with --, but that the -- is part of the comment.
>
>Therefore, the -- is optional and is only of concern if you wish to 
>actually recover the comment itself, which bogofilter probably doesn't 
>need to do.

Herman,

I _like_ your interpretation!  It fits well with what we actually 
see.  However, I don't think the purists would agree with you.

Personally, I find the wording to be odd.  It's hard to understand.  Having 
the "comment declaration" separate from the comment allows "<!>" to be used 
as empty comment - but don't ask me how that's useful.  Having "--" 
starting and ending comments also has its value.  It allows angle brackets 
to be included.  I do believe that "<!-->whatever<-->" contains a 10 
character comment ">whatever<".

Below is a sample of html with embedded comments.  It can be interpreted as 
an email message or as html (though it isn't quite eeither one).  It's not 
clear to me what makes up the right set of tokens for bogofilter to extract 
from it.

<html><body>
Content-Type: text/html

<br>one tw<!--this is a comment-->o three
<br>single dou<!--this is a comment-->ble triple
<br><!first> <!--second--> <!-->third<-->
<br>Please vis<! FF3FFi?FS$s0,sz>it our web<! FF3FFi?FS$s0,sz>si<! 
FF3FFi?FS$s0,sz>te
<br>Please vis<!-- FF3FFi?FS$s0,sz>it our web<! FF3FFi?FS$s0,sz>si<! 
FF3FFi?FS$s0,sz>te
</body></html>

Enjoy!

David





More information about the Bogofilter mailing list