What to do for HTML comment processing ???

Nick Simicich njs at scifi.squawk.com
Tue Mar 4 16:01:38 CET 2003


Bogofilter should do only what IE does. Forget anything else, it is  beyond 
stupid.  I'm sorry, but it does not matter what the spammers are doing, it 
matters how it is rendered.  Understand?  Those things are not comments, no 
matter what the standard calls them because the browser does not treat them 
as comments.

Do you understand "unknown tags"?  All browsers understand unknown tags, 
and basically they ignore them.  IT IS NOT A COMMENT AND SHOULD NOT BE 
TREATED AS ONE!!!!!!!!  It should be handled by tag processing.  There are 
other tags that start <! and end >.  Unknown tags do not cause eyespace 
breaks, they should be moved out of tokens unless you have a tag in the 
sequence that causes an eyespace break or whitespace.

<! anything >

is an unknown tag, not a comment to IE.

<!-- --> is a comment.  It only starts specifically <!-- and ends 
specifically -->.  Period.  If you do ANYTHING ELSE, YOU HAVE BROKEN 
BOGOFILTER!!!!!

At 12:51 PM 2003-02-28 -0500, David Relson wrote:

>Hi,
>
>Two things today...
>
>First, bogofilter has options for turning off the killing of html 
>comments.  If there are no objections, I going to remove the options, 
>which means that bogofilter will _always_ kill html comments (in html text).
>
>Second, it has been suggested that bogofilter be more aggressive in its 
>handling of html comments.  According to the standard, a non-empty comment 
>looks like "<!--comment-->" with white space allowed before/after the 
>pairs of dashes.  The "comment" part (between the pairs of dashes) can 
>include most anything.  In particular it can include angle brackets, but 
>not pairs of dashes.  You can have multiple comments inside the "<!" and 
>">" delimiters, as in "<! -- comment one -- -- <this comment has angle 
>brackets> -- >"

The standard is ignored.  <! anything > is not a comment, so it can end >.
<!-- is a comment and can only end -->.   Please do not get confused by 
isolated examples of what spammers send out.  The standard is not 
implemented, it seems to have been completely ignored because of downward 
compatability.

>A while back spam without the trailing dashes was reported and bogofilter 
>was modified to consider that as a valid commeent and discard it.  Now 
>there's spam without the leading dashes.

Those are not comments according to most browsers.  Treaty them as unknown 
tags.

>Should bogofilter simply forget that dashes are in the html standard and 
>treat "<!whatever is inside the angle brackets>" as a comment?

No.  The browsers do not, bogofilter should not, no matter how people  have 
voted.  There is only one "right" thing to do here and that is to do what 
IE (and coincidentally Netscape, although that does not matter almost) does.

>Probably the lexer can be written so that proper comments (with 
>leading/trailing dashes) are recognized as well as comments without any 
>dashes.  The proper treatment of "<!-- proper start, improper end>" 
>remains a question.

BROKEN!!!!  No, there is a "proper treatment".  Do what IE does.  <!-- only 
ends -->.  <anything else ends >  If you end a <!-- comment on a > as you 
do now, you are not doing what the browser does.

>Comments requested ...

A "naked" < is rendered.  <> is rendered. <[a-zA-Z!] unless <!/-- is a tag 
and ends >. <!-- is a comment and ends -->.   I do not know if there are 
other characters in the set [a-zA-Z!].

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!



More information about the Bogofilter mailing list