What to do for HTML comment processing ???
Nick Simicich
njs at scifi.squawk.com
Tue Mar 4 16:01:38 CET 2003
Bogofilter should do only what IE does. Forget anything else, it is beyond
stupid. I'm sorry, but it does not matter what the spammers are doing, it
matters how it is rendered. Understand? Those things are not comments, no
matter what the standard calls them because the browser does not treat them
as comments.
Do you understand "unknown tags"? All browsers understand unknown tags,
and basically they ignore them. IT IS NOT A COMMENT AND SHOULD NOT BE
TREATED AS ONE!!!!!!!! It should be handled by tag processing. There are
other tags that start <! and end >. Unknown tags do not cause eyespace
breaks, they should be moved out of tokens unless you have a tag in the
sequence that causes an eyespace break or whitespace.
<! anything >
is an unknown tag, not a comment to IE.
<!-- --> is a comment. It only starts specifically <!-- and ends
specifically -->. Period. If you do ANYTHING ELSE, YOU HAVE BROKEN
BOGOFILTER!!!!!
At 12:51 PM 2003-02-28 -0500, David Relson wrote:
>Hi,
>
>Two things today...
>
>First, bogofilter has options for turning off the killing of html
>comments. If there are no objections, I going to remove the options,
>which means that bogofilter will _always_ kill html comments (in html text).
>
>Second, it has been suggested that bogofilter be more aggressive in its
>handling of html comments. According to the standard, a non-empty comment
>looks like "<!--comment-->" with white space allowed before/after the
>pairs of dashes. The "comment" part (between the pairs of dashes) can
>include most anything. In particular it can include angle brackets, but
>not pairs of dashes. You can have multiple comments inside the "<!" and
>">" delimiters, as in "<! -- comment one -- -- <this comment has angle
>brackets> -- >"
The standard is ignored. <! anything > is not a comment, so it can end >.
<!-- is a comment and can only end -->. Please do not get confused by
isolated examples of what spammers send out. The standard is not
implemented, it seems to have been completely ignored because of downward
compatability.
>A while back spam without the trailing dashes was reported and bogofilter
>was modified to consider that as a valid commeent and discard it. Now
>there's spam without the leading dashes.
Those are not comments according to most browsers. Treaty them as unknown
tags.
>Should bogofilter simply forget that dashes are in the html standard and
>treat "<!whatever is inside the angle brackets>" as a comment?
No. The browsers do not, bogofilter should not, no matter how people have
voted. There is only one "right" thing to do here and that is to do what
IE (and coincidentally Netscape, although that does not matter almost) does.
>Probably the lexer can be written so that proper comments (with
>leading/trailing dashes) are recognized as well as comments without any
>dashes. The proper treatment of "<!-- proper start, improper end>"
>remains a question.
BROKEN!!!! No, there is a "proper treatment". Do what IE does. <!-- only
ends -->. <anything else ends > If you end a <!-- comment on a > as you
do now, you are not doing what the browser does.
>Comments requested ...
A "naked" < is rendered. <> is rendered. <[a-zA-Z!] unless <!/-- is a tag
and ends >. <!-- is a comment and ends -->. I do not know if there are
other characters in the set [a-zA-Z!].
--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally
to mean electronic messages designed to be read by an individual, and it
can include Usenet, SMS, AIM, etc. But if it is not all three of
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term
plays into the hands of the spammers, since it causes confusion, and
spammers thrive on confusion. Spam is not speech, it is an action, like
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!
More information about the Bogofilter
mailing list