bogolexer

Sun Feb 2 15:37:49 CET 2003

Hi Nick,

An interesting message and some very good observations and thoughts on 
bogolexer.  Initially bogolexer was just used in testing to verify that 
messages were being tokenized properly.  As you discovered, it has become a 
very useful tool for seeeing how a message is read and parsed into tokens.

As I stated in an earlier message, the primary goal of the 0.10 release has 
been to include mime processing with all the ensuing complexity of 
content-types and content-encoding and mime component header and body 
parts.  A secondary goal was to split the lexer grammar into three 
components - for message headers, for plain text, and for html text.  We 
also took out the discarding of html keywords (color,align,etc) and added 
the killing of html comments.

Naturally all these tasks seemed easier to implement than they really 
were.  Ideally a lexer rule to discard an html comment would elide the text 
before and after the comment.  However that's not how flex operates and the 
discarded comment is treated as a delimiter.  Thus "character" 
is two tokens.  That made it necessary for killing html comments to be a 
preprocessor pass.  Life would have been good if all spammers used "" to begin and end their comments.  However some spam uses ">" as 
the end.  So, the code changes as reality intrudes.

At present, bogofilter discards the comments, plain and simple.  Since 
comments can contain any sort of random junk (and some do), keeping the 
tokens is counterproductive.

At present, bogofilter also discards the contents of html tags.  That's 
likely to change, though we developers need feedback as to what people 
think should be done with them.  Should we discard the standard keywords or 
keep them?  What should we do with URL's?  with color values? etc, 
etc.  There are many things that can be done and there's the whole future 
in which to do them.

Moving html tags to the beginning or end of the buffer could be done.

At 05:28 AM 2/2/03, Nick Simicich wrote:
>I was trying to look at the issue of deleting all of the html tags rather 
>than just the html comments.  This would be because someone can do 
>something like:
>
>P<b></b>ick y<i>o</i>ur s</b>p</b>a</b>m up here.  This will probably 
>format quite nicely in "eyespace" - the only visible change will be that 
>the "o" in "your" will be italicized, but bogofilter will not get the 
>tags.  I noted a couple of issues.
>
>First, I naively tried testing by using bogolexer and just echoing into 
>it.  It might be useful to have an initial state for bogolexer so that you 
>can force it to treat one liners as html.  I finally prepared a simple 
>html only mail.
>
>Bug:  bogolexer has two separate helps that come out at different points 
>-- one comes out if you make a mistake, the other one comes out if you 
>type -h.  I thought about doing a patch, but this looks intentional. I 
>typically try to get help from a program with -?, and I assumed that was 
>all that there was, especially since the man page matches the usage() and 
>not the help().  IMHO, this should all match - all options should be in 
>the usage, explained in the help, and mentioned in the man page, unless 
>you want to tell users in the man page to use -h on the program.  I 
>honestly did not find the options hidden in -h until I looked at the source.
>
>The following output seemed to indicate that it is not eliding all the 
>html comments.  I presume that, for whatever reason, it is not dealing 
>with html mode here, it has this in text mode.  It is stripping the <!-- 
>and --> because they are specials, and not because they are html comment 
>encoders.
>
>[njs at parrot bogofilter-0.10.1.5]$ echo '<html> one two three fo<!-- foo 
>-->ur five </html>' | bogolexer -k y -v -v -v -v -v -xcdfglmrstw
>textblock.c:34  0x8058d80 0x8058d90  20 alloc, cur: 20, max: 20, tot: 20
>normal mode.
>  ... found.
>*** mime_reset
>*** mime_push. stackp: 0
>***  1 h,h 51 <html> one two three fo<!-- foo -->ur five </html>
>
>get_token: 1 'html'
>get_token: 1 'one'
>get_token: 1 'two'
>get_token: 1 'three'
>get_token: 1 'foo'
>get_token: 1 'five'
>get_token: 1 'html'
>***  2 h,h 0
>7 tokens read.
>textblock.c:78  0x8058d80 0x8058d90 free, cur: 8, max: 20, tot: 20
>cur: 0, max: 20, tot: 20
>[njs at parrot bogofilter-0.10.1.5]$
>
>If I prepare a file that looks like:
>
>[njs at parrot bogofilter-0.10.1.5]$ cat html2.test
> From njs at scifi.squawk.com
>To: njs at scifi.squawk.com
>From: njs at scifi.squawk.com
>Date: Thu, 30 Jan 2003 16:13:41 +0530
>Mime-Version: 1.0
>Content-Type: text/html
>
>one two th<i>r</i>ee fo<!-- foo -->ur f</b>i</b>v</b>e
>
>Then I get this:
>
>[njs at parrot bogofilter-0.10.1.5]$ cat html2.test | bogolexer -k y
>normal mode.
>get_token: 2 'from'
>get_token: 1 'njs'
>get_token: 1 'scifi.squawk.com'
>get_token: 1 'njs'
>get_token: 1 'scifi.squawk.com'
>get_token: 1 'from'
>get_token: 1 'njs'
>get_token: 1 'scifi.squawk.com'
>get_token: 1 'mime-version'
>get_token: 1 'content-type'
>get_token: 1 'text'
>get_token: 1 'html'
>get_token: 1 'one'
>get_token: 1 'two'
>get_token: 1 'four'
>15 tokens read.
>[njs at parrot bogofilter-0.10.1.5]$ cat html2.test | bogolexer -k n
>normal mode.
>get_token: 2 'from'
>get_token: 1 'njs'
>get_token: 1 'scifi.squawk.com'
>get_token: 1 'njs'
>get_token: 1 'scifi.squawk.com'
>get_token: 1 'from'
>get_token: 1 'njs'
>get_token: 1 'scifi.squawk.com'
>get_token: 1 'mime-version'
>get_token: 1 'content-type'
>get_token: 1 'text'
>get_token: 1 'html'
>get_token: 1 'one'
>get_token: 1 'two'
>14 tokens read.
>[njs at parrot bogofilter-0.10.1.5]$
>
>Is this intentional? (I think it is, because, unexpectedly, I figured out 
>much later that it is not a lex function to strip the html comments, even 
>though there are patterns for them in the lex).
>
>I was trying to see if I could determine what effect the html tag removal 
>had on the tokenizing.  The simple answer should be:  An html tag does not 
>end the token.  The only thing that ends a token is whitespace. But the 
>program is, in fact, ending the token when it encounters an html tag.
>
>Now, I decided to dig around in the program to see what you were actually 
>doing.  It finally seemed that you were stripping comments in a separate 
>pass in html.c, using the function named (oddly enough :-)) 
>kill_html_comment().  What was not obvious was that you were doing this in 
>a separate pass before you fed the input to lexx.  I spent way to much 
>time in the debugger tracing through the generated code before I finally 
>realized this.
>
>The crude patch below seems to extend the stripping action to all html 
>tokens (well, not all...).
>
>It is not clear that this is a good idea...there are some things you might 
>want to pull out of html tokens.  Perhaps the most important of those is 
>host names and ip addresses for ratware servers.  But it is a good idea to 
>pull the html out of the character stream.
>
>[njs at parrot bogofilter-0.10.1.5]$ diff -u html.c.orig html.c
>--- html.c.orig Sun Feb  2 03:44:56 2003
>+++ html.c      Sun Feb  2 03:45:20 2003
>@@ -16,11 +16,11 @@
>
>  /* Macro Definitions */
>
>-#define        COMMENT_START   "<!--"
>-#define        COMMENT_START_LEN 4             /* strlen(COMMENT_START) */
>+#define        COMMENT_START   "<"
>+#define        COMMENT_START_LEN 1             /* strlen(COMMENT_START) */
>
>-#define        COMMENT_END     "-->"
>-#define        COMMENT_END_LEN 3               /* strlen(COMMENT_END) */
>+#define        COMMENT_END     ">"
>+#define        COMMENT_END_LEN 1               /* strlen(COMMENT_END) */
>
>  /* Global Variables */
>
>[njs at parrot bogofilter-0.10.1.5]$
>
>Finally, an unmodified bogolexer fails when you put in two comments 
>back-to-back, as so:
>
>[root at scifi bogofilter-0.10.1.5]# ./bogolexer -I html3.test
>normal mode.
>get_token: 2 'from'
>get_token: 1 'njs'
>get_token: 1 'scifi.squawk.com'
>get_token: 1 'njs'
>get_token: 1 'scifi.squawk.com'
>get_token: 1 'from'
>get_token: 1 'njs'
>get_token: 1 'scifi.squawk.com'
>get_token: 1 'mime-version'
>get_token: 1 'content-type'
>get_token: 1 'text'
>get_token: 1 'html'
>get_token: 1 'one'
>get_token: 1 'four'
>14 tokens read.
>[root at scifi bogofilter-0.10.1.5]# cat html3.test
> From njs at scifi.squawk.com
>To: njs at scifi.squawk.com
>From: njs at scifi.squawk.com
>Date: Thu, 30 Jan 2003 16:13:41 +0530
>Mime-Version: 1.0
>Content-Type: text/html
>
>one t<!-- a --><!-- bee_comment -->wo th<iiii>r</iiii>ee fo<!-- 
>foo_comment -->ur f</b>i</b>v</b>e
>[root at scifi bogofilter-0.10.1.5]#
>
>Note that it has reassembled the "four", but not the "two", because the 
>two has two back to back comments.
>
>I'm fairly sure it is because of the "tmp += 1" in
>
>  for (tmp = buf; tmp < buf_used && (tmp = memchr(tmp, '<', buf_used - 
> tmp)) != NULL; tmp += 1) {
>
>in process_html_comments, but, frankly, I gave up trying to figure out how 
>you had optimized your loop.  I noticed this because when I applied the 
>above patch, it deleted every other html tag when the tags were back to back.
>
>I am of the opinion that the function that elides the html comments should 
>be reworked to simply move the comments and the tags to the beginning or 
>the end of the buffer.  Instead of sliding the buffer in and shortening 
>it, stash the area between the < and the >, and move it to one end of the 
>buffer or not.  There is still the issue of tags (or tokens, for that 
>matter) that cross boundaries.  I am not sure that your code will properly 
>deal with html comments that will cross boundaries.
>
>In any case, you have one hard bug (the back-to-back comment removal), one 
>soft bug (the changing of the usage message in bogolexer), one low 
>priority enhancement request (the request for initial state setting in 
>bogolexer so that it is easier to test html massaging code with echo) and 
>in my opinion, one high priority enhancement request that still requires a 
>bit of thought (the issue of reworking the comment removal code, since you 
>have to open it up anyway, to move the comments rather than eliding them, 
>and to move the tags as well, rather than eliding them).
>
>--
>SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
>spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
>to mean electronic messages designed to be read by an individual, and it 
>can include Usenet, SMS, AIM, etc.  But if it is not all three of 
>Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
>plays into the hands of the spammers, since it causes confusion, and 
>spammers thrive on  confusion. Spam is not speech, it is an action, like 
>theft, or vandalism. If you were not confused, would you patronize a spammer?
>Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
>Stop by and light up the world!
>
>
>
>---------------------------------------------------------------------
>FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
>To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
>For summary digest subscription: bogofilter-digest-subscribe at aotto.com
>For more commands, e-mail: bogofilter-help at aotto.com