HTML parsing
David Relson
relson at osagesoftware.com
Wed Nov 26 13:58:38 CET 2003
On Wed, 26 Nov 2003 13:29:23 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:
> David Relson wrote:
>
> >> As diskussed, that might cause HTML parsing even where not
> >> applicable. It does not seem to hurt. How about always doing
> >> HTML parsing? After all, what can happen? If there are
> >> things which look like HTML tags they will be treated as
> >> such, but what else?
> >
> > <!--The innards of HTML comments are ignored. So this response
> > doesn't exist. (except for the names) -->
> >
> > <!-- Likely there are other issues as well ... -->
>
> You are perfectly right that those things can happen (as
> with the DOCTYPE switch, if I just explain which one to use
> for a page). But are they likely? I don't know, I don't
> expect it. I'd test it, but my knowledge is not enought to
> modify the lexer to achieve this.
>
> I am wondering if this change would make the lexer simpler,
> maybe faster or smaller.
>
> pi
pi,
< given the left angle bracket in this line, an html parser would think
it's an html tag. Since bogofilter ignores the innards of invalid html
tags, this is another non-message.
The lexer size would decrease by a small amount. The DOCTYPE rule would
go away, but most everything else would still be needed. I tried the
experiment with the current lexer (lexer_v3.l.1.125). Here are the
numbers:
-rw-r--r-- 1 relson relson 11881 Nov 17 09:17 lexer_v3.l.1.125
-rw-r--r-- 1 relson relson 11829 Nov 26 07:47 lexer_v3.l.1.exp
-rw-rw-r-- 1 relson relson 103460 Nov 17 09:21 lexer_v3.c.1.125
-rw-r--r-- 1 relson relson 102634 Nov 26 07:49 lexer_v3.c.1.exp
text data bss dec hex filename
43169 8 60 43237 a8e5 lexer_v3.o.1.125
42635 8 60 42703 a6cf lexer_v3.o.1.exp
186269 22968 18808 228045 37acd bogofilter-1.125
185741 22968 18808 227517 378bd bogofilter-1.exp
The 500 byte change seems unimportant.
As to speed, the lexer is a state machine that uses tables to move from
state to state. Adding a new rule typically has no effect on its speed.
The simplification you suggest would not have a noticable effect on
speed.
David
P.S. There's no need to CC me on messages. It forces me to delete the
duplicate copy.
More information about the Bogofilter
mailing list