bogolexer
Nick Simicich
njs at scifi.squawk.com
Sun Feb 2 11:28:36 CET 2003
I was trying to look at the issue of deleting all of the html tags rather
than just the html comments. This would be because someone can do
something like:
P<b></b>ick y<i>o</i>ur s</b>p</b>a</b>m up here. This will probably
format quite nicely in "eyespace" - the only visible change will be that
the "o" in "your" will be italicized, but bogofilter will not get the
tags. I noted a couple of issues.
First, I naively tried testing by using bogolexer and just echoing into
it. It might be useful to have an initial state for bogolexer so that you
can force it to treat one liners as html. I finally prepared a simple html
only mail.
Bug: bogolexer has two separate helps that come out at different points --
one comes out if you make a mistake, the other one comes out if you type
-h. I thought about doing a patch, but this looks intentional. I typically
try to get help from a program with -?, and I assumed that was all that
there was, especially since the man page matches the usage() and not the
help(). IMHO, this should all match - all options should be in the usage,
explained in the help, and mentioned in the man page, unless you want to
tell users in the man page to use -h on the program. I honestly did not
find the options hidden in -h until I looked at the source.
The following output seemed to indicate that it is not eliding all the html
comments. I presume that, for whatever reason, it is not dealing with html
mode here, it has this in text mode. It is stripping the <!-- and -->
because they are specials, and not because they are html comment encoders.
[njs at parrot bogofilter-0.10.1.5]$ echo '<html> one two three fo<!-- foo
-->ur five </html>' | bogolexer -k y -v -v -v -v -v -xcdfglmrstw
textblock.c:34 0x8058d80 0x8058d90 20 alloc, cur: 20, max: 20, tot: 20
normal mode.
... found.
*** mime_reset
*** mime_push. stackp: 0
*** 1 h,h 51 <html> one two three fo<!-- foo -->ur five </html>
get_token: 1 'html'
get_token: 1 'one'
get_token: 1 'two'
get_token: 1 'three'
get_token: 1 'foo'
get_token: 1 'five'
get_token: 1 'html'
*** 2 h,h 0
7 tokens read.
textblock.c:78 0x8058d80 0x8058d90 free, cur: 8, max: 20, tot: 20
cur: 0, max: 20, tot: 20
[njs at parrot bogofilter-0.10.1.5]$
If I prepare a file that looks like:
[njs at parrot bogofilter-0.10.1.5]$ cat html2.test
From njs at scifi.squawk.com
To: njs at scifi.squawk.com
From: njs at scifi.squawk.com
Date: Thu, 30 Jan 2003 16:13:41 +0530
Mime-Version: 1.0
Content-Type: text/html
one two th<i>r</i>ee fo<!-- foo -->ur f</b>i</b>v</b>e
Then I get this:
[njs at parrot bogofilter-0.10.1.5]$ cat html2.test | bogolexer -k y
normal mode.
get_token: 2 'from'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'from'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'mime-version'
get_token: 1 'content-type'
get_token: 1 'text'
get_token: 1 'html'
get_token: 1 'one'
get_token: 1 'two'
get_token: 1 'four'
15 tokens read.
[njs at parrot bogofilter-0.10.1.5]$ cat html2.test | bogolexer -k n
normal mode.
get_token: 2 'from'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'from'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'mime-version'
get_token: 1 'content-type'
get_token: 1 'text'
get_token: 1 'html'
get_token: 1 'one'
get_token: 1 'two'
14 tokens read.
[njs at parrot bogofilter-0.10.1.5]$
Is this intentional? (I think it is, because, unexpectedly, I figured out
much later that it is not a lex function to strip the html comments, even
though there are patterns for them in the lex).
I was trying to see if I could determine what effect the html tag removal
had on the tokenizing. The simple answer should be: An html tag does not
end the token. The only thing that ends a token is whitespace. But the
program is, in fact, ending the token when it encounters an html tag.
Now, I decided to dig around in the program to see what you were actually
doing. It finally seemed that you were stripping comments in a separate
pass in html.c, using the function named (oddly enough :-))
kill_html_comment(). What was not obvious was that you were doing this in
a separate pass before you fed the input to lexx. I spent way to much time
in the debugger tracing through the generated code before I finally
realized this.
The crude patch below seems to extend the stripping action to all html
tokens (well, not all...).
It is not clear that this is a good idea...there are some things you might
want to pull out of html tokens. Perhaps the most important of those is
host names and ip addresses for ratware servers. But it is a good idea to
pull the html out of the character stream.
[njs at parrot bogofilter-0.10.1.5]$ diff -u html.c.orig html.c
--- html.c.orig Sun Feb 2 03:44:56 2003
+++ html.c Sun Feb 2 03:45:20 2003
@@ -16,11 +16,11 @@
/* Macro Definitions */
-#define COMMENT_START "<!--"
-#define COMMENT_START_LEN 4 /* strlen(COMMENT_START) */
+#define COMMENT_START "<"
+#define COMMENT_START_LEN 1 /* strlen(COMMENT_START) */
-#define COMMENT_END "-->"
-#define COMMENT_END_LEN 3 /* strlen(COMMENT_END) */
+#define COMMENT_END ">"
+#define COMMENT_END_LEN 1 /* strlen(COMMENT_END) */
/* Global Variables */
[njs at parrot bogofilter-0.10.1.5]$
Finally, an unmodified bogolexer fails when you put in two comments
back-to-back, as so:
[root at scifi bogofilter-0.10.1.5]# ./bogolexer -I html3.test
normal mode.
get_token: 2 'from'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'from'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'mime-version'
get_token: 1 'content-type'
get_token: 1 'text'
get_token: 1 'html'
get_token: 1 'one'
get_token: 1 'four'
14 tokens read.
[root at scifi bogofilter-0.10.1.5]# cat html3.test
From njs at scifi.squawk.com
To: njs at scifi.squawk.com
From: njs at scifi.squawk.com
Date: Thu, 30 Jan 2003 16:13:41 +0530
Mime-Version: 1.0
Content-Type: text/html
one t<!-- a --><!-- bee_comment -->wo th<iiii>r</iiii>ee fo<!-- foo_comment
-->ur f</b>i</b>v</b>e
[root at scifi bogofilter-0.10.1.5]#
Note that it has reassembled the "four", but not the "two", because the two
has two back to back comments.
I'm fairly sure it is because of the "tmp += 1" in
for (tmp = buf; tmp < buf_used && (tmp = memchr(tmp, '<', buf_used -
tmp)) != NULL; tmp += 1) {
in process_html_comments, but, frankly, I gave up trying to figure out how
you had optimized your loop. I noticed this because when I applied the
above patch, it deleted every other html tag when the tags were back to back.
I am of the opinion that the function that elides the html comments should
be reworked to simply move the comments and the tags to the beginning or
the end of the buffer. Instead of sliding the buffer in and shortening it,
stash the area between the < and the >, and move it to one end of the
buffer or not. There is still the issue of tags (or tokens, for that
matter) that cross boundaries. I am not sure that your code will properly
deal with html comments that will cross boundaries.
In any case, you have one hard bug (the back-to-back comment removal), one
soft bug (the changing of the usage message in bogolexer), one low priority
enhancement request (the request for initial state setting in bogolexer so
that it is easier to test html massaging code with echo) and in my opinion,
one high priority enhancement request that still requires a bit of thought
(the issue of reworking the comment removal code, since you have to open it
up anyway, to move the comments rather than eliding them, and to move the
tags as well, rather than eliding them).
--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally
to mean electronic messages designed to be read by an individual, and it
can include Usenet, SMS, AIM, etc. But if it is not all three of
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term
plays into the hands of the spammers, since it causes confusion, and
spammers thrive on confusion. Spam is not speech, it is an action, like
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!
More information about the Bogofilter
mailing list