[PATCH] consecutive html tags

David Relson relson at osagesoftware.com
Sun Feb 2 17:48:12 CET 2003


Nick,

I've looked at the manner in which kill_html_comments handles consecutive 
comments and confirmed the problem you reported.  I also have a patch which 
corrects the problem.  The patch, my test message, and the output are below.

Please confirm the fix.

Thank you.

David

### here's the patch ###

Index: html.c
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/html.c,v
retrieving revision 1.9
diff -u -r1.9 html.c
--- html.c	29 Jan 2003 02:29:01 -0000	1.9
+++ html.c	2 Feb 2003 16:42:23 -0000
@@ -90,9 +90,11 @@
  		level -= 1;
  	    }
  	}
-	if (level == 0)
-	    break;
-	tmp += 1;
+	else {
+	    tmp += 1;
+	    if (level == 0)
+		break;
+	}
  	/* When killing html comments, there's no need to keep it in memory */
  	if (kill_html_comments && buf_end - buf_used < COMMENT_END_LEN) {
  	    /* Leave enough to recognize the end of comment string. */

### this is the test input ###

[relson at osage cvs]$ cat ../msg.d/msg.ns.0202.1.txt
 From njs at scifi.squawk.com
To: njs at scifi.squawk.com
From: njs at scifi.squawk.com
Date: Thu, 30 Jan 2003 16:13:41 +0530
Mime-Version: 1.0
Content-Type: text/html

html-tags-are-delimiters
one two th<i>r</i>ee fo<!-- foo -->ur f</b>i</b>v</b>e

one-html-comment-is-ok
one t<!-- a bee_comment -->wo th<iiii>r</iiii>ee fo<!-- foo_comment -->ur 
f</b>i</b>v</b>e

two-html-comments-are-bad
one t<!-- a --><!-- b -->wo th<iiii>r</iiii>ee fo<!-- foo_comment -->ur 
f</b>i</b>v</b>e


### this is the output ###
[relson at osage cvs]$ bogolexer -p < ../msg.d/msg.ns.0202.1.txt
from
njs
scifi.squawk.com
njs
scifi.squawk.com
from
njs
scifi.squawk.com
mime-version
content-type
text
html
html-tags-are-delimiters
one
two
four
one-html-comment-is-ok
one
two
four
two-html-comments-are-bad
one
two
four





More information about the bogofilter-dev mailing list