html comment processing [was: database rebuild]

Nick Simicich njs at scifi.squawk.com
Sat Feb 8 22:53:44 CET 2003


At 12:55 PM 2003-02-08 -0500, David Relson wrote:

>Proper checking for the comment declaration strings ( "<!" and ">") is 
>more complex.  The initial implementation in bogofilter checked for "<!--" 
>and "-->", but we discovered that spammers sometimes used ">" to end the 
>comment.  To deal with this bogofilter changed.  At present I'm thinking 
>that bogofilter should do proper checking (by default) and have an option 
>for the other way, perhaps "relaxed_html_comments=true".

The reason I wrote the page and tried to get people to test it with various 
browsers was to learn what the browsers and renderers did with 
comments.  My assertion is that it does not matter what the spammers 
do.  If they format mail incorrectly, then people will not see it.  The 
question is, "What is the "eyespace" that people work in.

It looks like there are two major styles (now with Opera and w3m, three).

Style 1 is clearly the majority style, used by netscape/Mozilla/AOL's 
renderer, and the IE/Microsoft Renderer: End the comment at the first -->, 
ignore everything else.

Style 2 is the Links/Lynx style, which is to generally end the comment at >

Style 3 is the Opera/w3m style, which is to end the comment at the first 
--[:space:]*> pattern.

All three are buggy, by the way. Style three is the least buggy, but it is 
attackable. Right now, you are following style 2, which allows someone to 
craft an attack that will brewak you and render with IE/Mozilla.  I think 
that is a bad thing.

My personal feeling is that you should go with style 1, because, far and 
away, the most widely used.  If you have three pieces of mail that are 
broken, well, they are broken.

Has the code stabilized such that I could do the comment and tag migration, 
and where would I get it?

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!



More information about the Bogofilter mailing list