bogolexer

Sun Feb 2 11:28:36 CET 2003

I was trying to look at the issue of deleting all of the html tags rather 
than just the html comments. This would be because someone can do 
something like:

Pick your spam up here. This will probably 
format quite nicely in "eyespace" - the only visible change will be that 
the "o" in "your" will be italicized, but bogofilter will not get the 
tags. I noted a couple of issues.

First, I naively tried testing by using bogolexer and just echoing into 
it. It might be useful to have an initial state for bogolexer so that you 
can force it to treat one liners as html. I finally prepared a simple html 
only mail.

Bug: bogolexer has two separate helps that come out at different points -- 
one comes out if you make a mistake, the other one comes out if you type 
-h. I thought about doing a patch, but this looks intentional. I typically 
try to get help from a program with -?, and I assumed that was all that 
there was, especially since the man page matches the usage() and not the 
help(). IMHO, this should all match - all options should be in the usage, 
explained in the help, and mentioned in the man page, unless you want to 
tell users in the man page to use -h on the program. I honestly did not 
find the options hidden in -h until I looked at the source.

The following output seemed to indicate that it is not eliding all the html 
comments. I presume that, for whatever reason, it is not dealing with html 
mode here, it has this in text mode. It is stripping the  
because they are specials, and not because they are html comment encoders.

[njs at parrot bogofilter-0.10.1.5]$ echo '<html> one two three four five </html>' | bogolexer -k y -v -v -v -v -v -xcdfglmrstw
textblock.c:34 0x8058d80 0x8058d90 20 alloc, cur: 20, max: 20, tot: 20
normal mode.
 ... found.
*** mime_reset
*** mime_push. stackp: 0
*** 1 h,h 51 <html> one two three four five </html>

get_token: 1 'html'
get_token: 1 'one'
get_token: 1 'two'
get_token: 1 'three'
get_token: 1 'foo'
get_token: 1 'five'
get_token: 1 'html'
*** 2 h,h 0
7 tokens read.
textblock.c:78 0x8058d80 0x8058d90 free, cur: 8, max: 20, tot: 20
cur: 0, max: 20, tot: 20
[njs at parrot bogofilter-0.10.1.5]$

If I prepare a file that looks like:

[njs at parrot bogofilter-0.10.1.5]$ cat html2.test
 From njs at scifi.squawk.com
To: njs at scifi.squawk.com
From: njs at scifi.squawk.com
Date: Thu, 30 Jan 2003 16:13:41 +0530
Mime-Version: 1.0
Content-Type: text/html

one two three four five

Then I get this:

[njs at parrot bogofilter-0.10.1.5]$ cat html2.test | bogolexer -k y
normal mode.
get_token: 2 'from'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'from'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'mime-version'
get_token: 1 'content-type'
get_token: 1 'text'
get_token: 1 'html'
get_token: 1 'one'
get_token: 1 'two'
get_token: 1 'four'
15 tokens read.
[njs at parrot bogofilter-0.10.1.5]$ cat html2.test | bogolexer -k n
normal mode.
get_token: 2 'from'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'from'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'mime-version'
get_token: 1 'content-type'
get_token: 1 'text'
get_token: 1 'html'
get_token: 1 'one'
get_token: 1 'two'
14 tokens read.
[njs at parrot bogofilter-0.10.1.5]$

Is this intentional? (I think it is, because, unexpectedly, I figured out 
much later that it is not a lex function to strip the html comments, even 
though there are patterns for them in the lex).

I was trying to see if I could determine what effect the html tag removal 
had on the tokenizing. The simple answer should be: An html tag does not 
end the token. The only thing that ends a token is whitespace. But the 
program is, in fact, ending the token when it encounters an html tag.

Now, I decided to dig around in the program to see what you were actually 
doing. It finally seemed that you were stripping comments in a separate 
pass in html.c, using the function named (oddly enough :-)) 
kill_html_comment(). What was not obvious was that you were doing this in 
a separate pass before you fed the input to lexx. I spent way to much time 
in the debugger tracing through the generated code before I finally 
realized this.

The crude patch below seems to extend the stripping action to all html 
tokens (well, not all...).

It is not clear that this is a good idea...there are some things you might 
want to pull out of html tokens. Perhaps the most important of those is 
host names and ip addresses for ratware servers. But it is a good idea to 
pull the html out of the character stream.

[njs at parrot bogofilter-0.10.1.5]$ diff -u html.c.orig html.c

--- html.c.orig Sun Feb  2 03:44:56 2003
+++ html.c      Sun Feb  2 03:45:20 2003
@@ -16,11 +16,11 @@

  /* Macro Definitions */

-#define        COMMENT_START   "<!--"
-#define        COMMENT_START_LEN 4             /* strlen(COMMENT_START) */
+#define        COMMENT_START   "<"
+#define        COMMENT_START_LEN 1             /* strlen(COMMENT_START) */

-#define        COMMENT_END     "-->"
-#define        COMMENT_END_LEN 3               /* strlen(COMMENT_END) */
+#define        COMMENT_END     ">"
+#define        COMMENT_END_LEN 1               /* strlen(COMMENT_END) */

  /* Global Variables */

[njs at parrot bogofilter-0.10.1.5]$

Finally, an unmodified bogolexer fails when you put in two comments 
back-to-back, as so:

[root at scifi bogofilter-0.10.1.5]# ./bogolexer -I html3.test
normal mode.
get_token: 2 'from'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'from'
get_token: 1 'njs'
get_token: 1 'scifi.squawk.com'
get_token: 1 'mime-version'
get_token: 1 'content-type'
get_token: 1 'text'
get_token: 1 'html'
get_token: 1 'one'
get_token: 1 'four'
14 tokens read.
[root at scifi bogofilter-0.10.1.5]# cat html3.test
 From njs at scifi.squawk.com
To: njs at scifi.squawk.com
From: njs at scifi.squawk.com
Date: Thu, 30 Jan 2003 16:13:41 +0530
Mime-Version: 1.0
Content-Type: text/html

one t<!-- a --><!-- bee_comment -->wo th<iiii>r</iiii>ee fo<!-- foo_comment 
-->ur f</b>i</b>v</b>e
[root at scifi bogofilter-0.10.1.5]#

Note that it has reassembled the "four", but not the "two", because the two 
has two back to back comments.

I'm fairly sure it is because of the "tmp += 1" in

  for (tmp = buf; tmp < buf_used && (tmp = memchr(tmp, '<', buf_used - 
tmp)) != NULL; tmp += 1) {

in process_html_comments, but, frankly, I gave up trying to figure out how 
you had optimized your loop.  I noticed this because when I applied the 
above patch, it deleted every other html tag when the tags were back to back.

I am of the opinion that the function that elides the html comments should 
be reworked to simply move the comments and the tags to the beginning or 
the end of the buffer.  Instead of sliding the buffer in and shortening it, 
stash the area between the < and the >, and move it to one end of the 
buffer or not.  There is still the issue of tags (or tokens, for that 
matter) that cross boundaries.  I am not sure that your code will properly 
deal with html comments that will cross boundaries.

In any case, you have one hard bug (the back-to-back comment removal), one 
soft bug (the changing of the usage message in bogolexer), one low priority 
enhancement request (the request for initial state setting in bogolexer so 
that it is easier to test html massaging code with echo) and in my opinion, 
one high priority enhancement request that still requires a bit of thought 
(the issue of reworking the comment removal code, since you have to open it 
up anyway, to move the comments rather than eliding them, and to move the 
tags as well, rather than eliding them).

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!