html comment processing
David Relson
relson at osagesoftware.com
Sun Mar 30 03:57:50 CEST 2003
Greetings,
Bogofilter has understood html since version 0.10.0 in January and has had
code for discarding html comments. Over time, bogofilter has used two
slightly different definitions to determine what to discard. Initially it
defined comments as "<!---whatever-->", which roughly matches the official
definition. Some spam didn't include the second pair of hyphens. This
caused enough trouble that bogofilters definition was changed to
"<!--whatever>" (for 0.10.3.1, the previous stable version). For 0.11.0
bogofilter returned to the strict definition. Recently people have been
receiving spam with constructs like "Please vis<! FF3FFi?FS$s0,sz>it our
web<! FF3FFi?FS$s0,sz>si<! FF3FFi?FS$s0,sz>te". Bogofilter handles this
poorly.
The question at hand is "How should bogofilter define html comments?"
W3C defines html comments at
http://www.w3.org/MarkUp/html-spec/html-spec_3.html#SEC3.2.5 as:
"To include comments in an HTML document, use a comment declaration. A
comment declaration consists of `<!' followed by zero or more comments
followed by `>'. Each comment starts with `--' and includes all text up to
and including the next occurrence of `--'. In a comment declaration, white
space is allowed after each comment, but not before the first comment. The
entire comment declaration is ignored."
This corresponds quite closely to bogofilter's current (0.11.1.5)
definition and doesn't work well with either the second or third samples
above. Since bogofilter has to live in the real world, it should process
html comments so as to best recognize the text in the message. The current
practice of spammers is to use the hyphens, but only sometimes. Bogofilter
should be able to process comments whether or not they have the hyphens.
A quick browser check indicates that the double hyphens are totally
ignored, i.e. "<!whatever>" is treated as a comment. Bogofilter's default
mode should be to duplicate this behavior. For those who want to
experiment, a patch is attached.
Also of note, today there has been a discussion titled "It's getting
worse", which is about spam with html comments lacking hyphens. The
"Please vis<!..>it our ..." sample is from a message in the
discussion. The attached patch fixes that problem as well.
For the html purists, I propose to add a config file option named
"strict_comment". A value of "true" will cause bogofilter to follow the
standard and a value of "false" will work as described above. The default
value will be "false".
David
-------------- next part --------------
Index: html.c
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/src/html.c,v
retrieving revision 1.12
diff -u -r1.12 html.c
--- html.c 10 Mar 2003 05:13:40 -0000 1.12
+++ html.c 30 Mar 2003 01:47:52 -0000
@@ -18,19 +18,18 @@
#include "html.h"
#include "lexer.h"
-/* Macro Definitions */
-
-#define COMMENT_START "<!--"
-#define COMMENT_START_LEN 4 /* strlen(COMMENT_START) */
-
-#define COMMENT_END "-->"
-#define COMMENT_END_LEN 3 /* strlen(COMMENT_END) */
-
/* Function Declarations */
static int kill_html_comment(buff_t *buff, size_t comment_start);
-/* http://www.w3.org/MarkUp/html-spec/html-spec_3.html#SEC3.2.5
+bool strict_check = false;
+
+/* If strict_check is enabled, bogofilter will check for "<!--" and "-->".
+** If strict_check is disabled, bogofilter will check for "<!" and ">".
+**
+** The strict mode corresponds to the comment definition at:
+**
+** http://www.w3.org/MarkUp/html-spec/html-spec_3.html#SEC3.2.5
**
** Comments:
**
@@ -79,6 +78,11 @@
bool done = false;
byte *tmp = buf_beg;
+ const char *start = strict_check ? "<!--" : "<!";
+ const char *finish = strict_check ? "-->" : ">";
+ size_t start_len = strlen(start);
+ size_t finish_len = strlen(finish);
+
while (!done) {
byte c;
size_t need;
@@ -94,7 +98,7 @@
*/
c = *tmp;
- need = (c == '<') ? COMMENT_START_LEN : 1;
+ need = (c == '<') ? start_len : finish_len;
buf_used = buf_beg + buff->t.leng - comment_offset;
avail = buf_used - tmp;
@@ -114,24 +118,21 @@
{
/* ensure buffer has sufficient characters for test */
/* check for comment delimiter */
- if (memcmp(tmp, COMMENT_START, COMMENT_START_LEN) != 0)
+ if (memcmp(tmp, start, start_len) != 0)
tmp += 1;
else {
comment = tmp;
level += 1;
- tmp += COMMENT_START_LEN;
+ tmp += start_len;
}
break;
}
case '>':
{
- /* Hack to only check for ">" rather than complete terminator "-->" */
- bool short_check = false;
if (level == 0)
done = true;
tmp += 1;
- if (level > 0 && (short_check ||
- memcmp(tmp - COMMENT_END_LEN, COMMENT_END, COMMENT_END_LEN) == 0))
+ if (level > 0 && (memcmp(tmp - finish_len, finish, finish_len) == 0))
{
/* eat comment */
buff_shift(buff, comment, tmp - comment);
@@ -155,7 +156,7 @@
/* When killing html comments, there's no need to keep it in memory */
if (comment != NULL &&
- buf_end - buf_used < COMMENT_END_LEN)
+ (size_t)(buf_end - buf_used) < finish_len)
{
/* Leave enough to recognize the end of comment string. */
size_t shift = tmp - comment;
More information about the Bogofilter
mailing list