What to do for HTML comment processing ???

David Relson relson at osagesoftware.com
Fri Feb 28 18:51:12 CET 2003


Hi,

Two things today...

First, bogofilter has options for turning off the killing of html 
comments.  If there are no objections, I going to remove the options, which 
means that bogofilter will _always_ kill html comments (in html text).

Second, it has been suggested that bogofilter be more aggressive in its 
handling of html comments.  According to the standard, a non-empty comment 
looks like "<!--comment-->" with white space allowed before/after the pairs 
of dashes.  The "comment" part (between the pairs of dashes) can include 
most anything.  In particular it can include angle brackets, but not pairs 
of dashes.  You can have multiple comments inside the "<!" and ">" 
delimiters, as in "<! -- comment one -- -- <this comment has angle 
brackets> -- >"

A while back spam without the trailing dashes was reported and bogofilter 
was modified to consider that as a valid commeent and discard it.  Now 
there's spam without the leading dashes.

Should bogofilter simply forget that dashes are in the html standard and 
treat "<!whatever is inside the angle brackets>" as a comment?

Probably the lexer can be written so that proper comments (with 
leading/trailing dashes) are recognized as well as comments without any 
dashes.  The proper treatment of "<!-- proper start, improper end>" remains 
a question.

Comments requested ...

David





More information about the Bogofilter mailing list