[PATCH] consecutive html tags

Nick Simicich njs at scifi.squawk.com
Mon Feb 3 01:51:32 CET 2003


 >Nick,
 >
 >I've looked at the manner in which kill_html_comments handles consecutive
 >comments and confirmed the problem you reported.  I also have a patch
 >which corrects the problem.  The patch, my test message, and the output
 >are below.
 >
 >Please confirm the fix.

This works.  It also works for my modified version that kills all html tags
simply by changing the strip tags.

Looking at dealing with the input line by line, I do not see how to do the
reordering like I suggested.  Maybe if you move the entire section into
memory before you start moving the tags to the beginning or end of the
section?  I will look at your patched program and try to do some guessing.


 >Thank you.
 >
 >David
 >
 >### here's the patch ###
 >
 >Index: html.c
 >===================================================================
 >RCS file: /cvsroot/bogofilter/bogofilter/html.c,v
 >retrieving revision 1.9
 >diff -u -r1.9 html.c
 >--- html.c      29 Jan 2003 02:29:01 -0000      1.9
 >+++ html.c      2 Feb 2003 16:42:23 -0000
 >@@ -90,9 +90,11 @@
 >                 level -= 1;
 >             }
 >         }
 >-       if (level == 0)
 >-           break;
 >-       tmp += 1;
 >+       else {
 >+           tmp += 1;
 >+           if (level == 0)
 >+               break;
 >+       }
 >         /* When killing html comments, there's no need to keep it in
 > memory */
 >         if (kill_html_comments && buf_end - buf_used < COMMENT_END_LEN) {
 >             /* Leave enough to recognize the end of comment string. */
 >
 >### this is the test input ###
 >
 >[relson at osage cvs]$ cat ../msg.d/msg.ns.0202.1.txt
 > From njs at scifi.squawk.com
 >To: njs at scifi.squawk.com
 >From: njs at scifi.squawk.com
 >Date: Thu, 30 Jan 2003 16:13:41 +0530
 >Mime-Version: 1.0
 >Content-Type: text/html
 >
 >html-tags-are-delimiters
 >one two th<i>r</i>ee fo<!-- foo -->ur f</b>i</b>v</b>e
 >
 >one-html-comment-is-ok
 >one t<!-- a bee_comment -->wo th<iiii>r</iiii>ee fo<!-- foo_comment -->ur
 >f</b>i</b>v</b>e
 >
 >two-html-comments-are-bad
 >one t<!-- a --><!-- b -->wo th<iiii>r</iiii>ee fo<!-- foo_comment -->ur
 >f</b>i</b>v</b>e
 >
 >
 >### this is the output ###
 >[relson at osage cvs]$ bogolexer -p < ../msg.d/msg.ns.0202.1.txt
 >from
 >njs
 >scifi.squawk.com
 >njs
 >scifi.squawk.com
 >from
 >njs
 >scifi.squawk.com
 >mime-version
 >content-type
 >text
 >html
 >html-tags-are-delimiters
 >one
 >two
 >four
 >one-html-comment-is-ok
 >one
 >two
 >four
 >two-html-comments-are-bad
 >one
 >two
 >four

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!



More information about the Bogofilter mailing list