html comment processing

David Relson relson at osagesoftware.com
Tue Apr 1 00:06:30 CEST 2003


At 04:56 PM 3/31/03, Emmanuel Seyman wrote:

> > <br>one tw<!--this is a comment-->o three
>
>The comment is "--this is a comment--" .
>
> > <br>single dou<!--this is a comment-->ble triple
>
>Same here.
>
> > <br><!first> <!--second--> <!-->third<-->
>
>"<!first>" is a comment declaration with data characters inside
>but no comment.
>
>The second comment declaration contains the comment "--second--".
>
> > <!-->third<-->
>
>Again "<!-->" is a comment declaration with data characters inside.
>"third" is part of the text. It needs to be counted.
>
>"<-->" is an illegal tag. To be ignored.
>
> > <br>Please vis<! FF3FFi?FS$s0,sz>it our web<! FF3FFi?FS$s0,sz>si<!
> > FF3FFi?FS$s0,sz>te
> > <br>Please vis<!-- FF3FFi?FS$s0,sz>it our web<! FF3FFi?FS$s0,sz>si<!
> > FF3FFi?FS$s0,sz>te
>
>All six comment declarations contain data characters but no comments.
>
>Emmanuel

Emmanuel,

At the moment, bogofilter is discarding html comments.  I'm more interested 
in what's a person will see in the message, i.e. what's left after the 
comments are removed.  I expect my sample to yield (roughly):

one two three
single double triple
first second third
Please visit our web site
Please visit our web site

At a future date, bogofilter will have options to allow scoring of html 
comments and the innards of html tags.  Possibly there will be separate 
options for valid html tags and invalid ones (such as the garbage spammers 
use).

David





More information about the Bogofilter mailing list