html comment processing
David Relson
relson at osagesoftware.com
Tue Apr 1 00:06:30 CEST 2003
At 04:56 PM 3/31/03, Emmanuel Seyman wrote:
> > <br>one tw<!--this is a comment-->o three
>
>The comment is "--this is a comment--" .
>
> > <br>single dou<!--this is a comment-->ble triple
>
>Same here.
>
> > <br><!first> <!--second--> <!-->third<-->
>
>"<!first>" is a comment declaration with data characters inside
>but no comment.
>
>The second comment declaration contains the comment "--second--".
>
> > <!-->third<-->
>
>Again "<!-->" is a comment declaration with data characters inside.
>"third" is part of the text. It needs to be counted.
>
>"<-->" is an illegal tag. To be ignored.
>
> > <br>Please vis<! FF3FFi?FS$s0,sz>it our web<! FF3FFi?FS$s0,sz>si<!
> > FF3FFi?FS$s0,sz>te
> > <br>Please vis<!-- FF3FFi?FS$s0,sz>it our web<! FF3FFi?FS$s0,sz>si<!
> > FF3FFi?FS$s0,sz>te
>
>All six comment declarations contain data characters but no comments.
>
>Emmanuel
Emmanuel,
At the moment, bogofilter is discarding html comments. I'm more interested
in what's a person will see in the message, i.e. what's left after the
comments are removed. I expect my sample to yield (roughly):
one two three
single double triple
first second third
Please visit our web site
Please visit our web site
At a future date, bogofilter will have options to allow scoring of html
comments and the innards of html tags. Possibly there will be separate
options for valid html tags and invalid ones (such as the garbage spammers
use).
David
More information about the Bogofilter
mailing list