html comment processing

Sun Mar 30 03:57:50 CEST 2003

Greetings,

Bogofilter has understood html since version 0.10.0 in January and has had 
code for discarding html comments.  Over time, bogofilter has used two 
slightly different definitions to determine what to discard.  Initially it 
defined comments as "", which roughly matches the official 
definition.  Some spam didn't include the second pair of hyphens.  This 
caused enough trouble that bogofilters definition was changed to 
"<!--whatever>" (for 0.10.3.1, the previous stable version).  For 0.11.0 
bogofilter returned to the strict definition.  Recently people have been 
receiving spam with constructs like "Please vis<! FF3FFi?FS$s0,sz>it our 
web<! FF3FFi?FS$s0,sz>si<! FF3FFi?FS$s0,sz>te".  Bogofilter handles this 
poorly.

The question at hand is "How should bogofilter define html comments?"

W3C defines html comments at 
http://www.w3.org/MarkUp/html-spec/html-spec_3.html#SEC3.2.5 as:

"To include comments in an HTML document, use a comment declaration. A 
comment declaration consists of `<!' followed by zero or more comments 
followed by `>'. Each comment starts with `--' and includes all text up to 
and including the next occurrence of `--'. In a comment declaration, white 
space is allowed after each comment, but not before the first comment. The 
entire comment declaration is ignored."

This corresponds quite closely to bogofilter's current (0.11.1.5) 
definition and doesn't work well with either the second or third samples 
above.  Since bogofilter has to live in the real world, it should process 
html comments so as to best recognize the text in the message.  The current 
practice of spammers is to use the hyphens, but only sometimes.  Bogofilter 
should be able to process comments whether or not they have the hyphens.

A quick browser check indicates that the double hyphens are totally 
ignored, i.e. "<!whatever>" is treated as a comment.  Bogofilter's default 
mode should be to duplicate this behavior.  For those who want to 
experiment, a patch is attached.

Also of note, today there has been a discussion titled "It's getting 
worse", which is about spam with html comments lacking hyphens.  The 
"Please vis<!..>it our ..." sample is from a message in the 
discussion.  The attached patch fixes that problem as well.

For the html purists, I propose to add a config file option named 
"strict_comment".  A value of "true" will cause bogofilter to follow the 
standard and a value of "false" will work as described above.  The default 
value will be "false".

David
-------------- next part --------------
Index: html.c
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/src/html.c,v
retrieving revision 1.12
diff -u -r1.12 html.c

--- html.c	10 Mar 2003 05:13:40 -0000	1.12
+++ html.c	30 Mar 2003 01:47:52 -0000
@@ -18,19 +18,18 @@
 #include "html.h"
 #include "lexer.h"
 
-/* Macro Definitions */
-
-#define	COMMENT_START	"<!--"
-#define	COMMENT_START_LEN	4	/* strlen(COMMENT_START) */
-
-#define	COMMENT_END	"-->"
-#define	COMMENT_END_LEN 	3	/* strlen(COMMENT_END) */
-
 /* Function Declarations */
 
 static int kill_html_comment(buff_t *buff, size_t comment_start);
 
-/* http://www.w3.org/MarkUp/html-spec/html-spec_3.html#SEC3.2.5
+bool strict_check = false;
+
+/* If strict_check is enabled, bogofilter will check for  "<!--" and "-->".
+** If strict_check is disabled, bogofilter will check for  "<!" and ">".
+**
+** The strict mode corresponds to the comment definition at:
+**
+** http://www.w3.org/MarkUp/html-spec/html-spec_3.html#SEC3.2.5
 **
 ** Comments:
 **
@@ -79,6 +78,11 @@
     bool done = false;
     byte *tmp = buf_beg;
 
+    const char *start = strict_check ? "<!--" : "<!";
+    const char *finish = strict_check ? "-->" : ">";
+    size_t start_len = strlen(start);
+    size_t finish_len = strlen(finish);
+
     while (!done) {
 	byte c;
 	size_t need;
@@ -94,7 +98,7 @@
 */
 	c = *tmp;
 
-	need = (c == '<') ? COMMENT_START_LEN : 1;
+	need = (c == '<') ? start_len : finish_len;
 
 	buf_used = buf_beg + buff->t.leng - comment_offset;
 	avail = buf_used - tmp;
@@ -114,24 +118,21 @@
 	{
 	    /* ensure buffer has sufficient characters for test */
 	    /* check for comment delimiter */
-	    if (memcmp(tmp, COMMENT_START, COMMENT_START_LEN) != 0)
+	    if (memcmp(tmp, start, start_len) != 0)
 		tmp += 1;
 	    else {
 		comment = tmp;
 		level += 1;
-		tmp += COMMENT_START_LEN;
+		tmp += start_len;
 	    }
 	    break;
 	}
 	case '>':
 	{
-	    /* Hack to only check for ">" rather than complete terminator "-->" */
-	    bool short_check = false;
 	    if (level == 0)
 		done = true;
 	    tmp += 1;
-	    if (level > 0 && (short_check || 
-			      memcmp(tmp - COMMENT_END_LEN, COMMENT_END, COMMENT_END_LEN) == 0))
+	    if (level > 0 && (memcmp(tmp - finish_len, finish, finish_len) == 0))
 	    {
 		/* eat comment */
 		buff_shift(buff, comment, tmp - comment);
@@ -155,7 +156,7 @@
 
 	/* When killing html comments, there's no need to keep it in memory */
 	if (comment != NULL && 
-	    buf_end - buf_used < COMMENT_END_LEN) 
+	    (size_t)(buf_end - buf_used) < finish_len)
 	{
 	    /* Leave enough to recognize the end of comment string. */
 	    size_t shift = tmp - comment;