HTML parsing

David Relson relson at osagesoftware.com
Wed Nov 26 13:58:38 CET 2003


On Wed, 26 Nov 2003 13:29:23 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> David Relson wrote:
> 
> >> As diskussed, that might cause HTML parsing even where not
> >> applicable. It does not seem to hurt. How about always doing
> >> HTML parsing? After all, what can happen? If there are
> >> things which look like HTML tags they will be treated as
> >> such, but what else?
> > 
> > <!--The innards of HTML comments are ignored.  So this response
> > doesn't exist.   (except for the names) -->
> > 
> > <!-- Likely there are other issues as well ... -->
> 
> You are perfectly right that those things can happen (as
> with the DOCTYPE switch, if I just explain which one to use
> for a page). But are they likely? I don't know, I don't
> expect it. I'd test it, but my knowledge is not enought to
> modify the lexer to achieve this.
> 
> I am wondering if this change would make the lexer simpler,
> maybe faster or smaller.
> 
> pi

pi,

< given the left angle bracket in this line, an html parser would think
it's an html tag.  Since bogofilter ignores the innards of invalid html
tags, this is another non-message.

The lexer size would decrease by a small amount.  The DOCTYPE rule would
go away, but most everything else would still be needed.  I tried the
experiment with the current lexer (lexer_v3.l.1.125).  Here are the
numbers:

-rw-r--r--    1 relson   relson      11881 Nov 17 09:17 lexer_v3.l.1.125
-rw-r--r--    1 relson   relson      11829 Nov 26 07:47 lexer_v3.l.1.exp
-rw-rw-r--    1 relson   relson     103460 Nov 17 09:21 lexer_v3.c.1.125
-rw-r--r--    1 relson   relson     102634 Nov 26 07:49 lexer_v3.c.1.exp

   text	   data	    bss	    dec	    hex	filename
  43169	      8	     60	  43237	   a8e5	lexer_v3.o.1.125
  42635	      8	     60	  42703	   a6cf	lexer_v3.o.1.exp

 186269	  22968	  18808	 228045	  37acd	bogofilter-1.125
 185741	  22968	  18808	 227517	  378bd	bogofilter-1.exp

The 500 byte change seems unimportant.

As to speed, the lexer is a state machine that uses tables to move from
state to state.  Adding a new rule typically has no effect on its speed.
 The simplification you suggest would not have a noticable effect on
speed.

David

P.S.  There's no need to CC me on messages.  It forces me to delete the
duplicate copy.




More information about the Bogofilter mailing list