[patch] small lexer changes

Mark M. Hoffman mhoffman at lightlink.com
Tue Oct 8 03:27:45 CEST 2002


* Gyepi SAM <gyepi at praxis-sw.com> [2002-10-07 17:25:56 -0400]:
> On Mon, Oct 07, 2002 at 04:46:22PM -0400, David Relson wrote:
> > At 04:24 PM 10/7/02, Matthias Andree wrote:
> > >I'm just wondering if we could reduce the lexer.c code size without
> > >sacrificing too much speed. lexer.c is >46,000 lines here, which pretty
> > >much stinks.
> > 
> > lexer.c is long because of the list of html tags words that are recognized 
> > and discarded.  Taking them out makes it much much shorter and is something 
> > I've done while testing its parsing.  However that would likely cost time 
> > during analysis - all those words to lookup in the word lists.
> 
> We could also move the html words from lexer.c into the ignore list, once it
> is implemented. That may even be better since people who don't want to exclude html keywords
> can simply delete them from the ignore list.
> 
> Presuming that lexer.c uses a linear search, putting them on the ignore list *may* be faster.

Flex builds a table-driven state machine.  Here's how to keep it fast:

http://www.gnu.org/manual/flex-2.5.4/html_chapter/flex_18.html#SEC18

If you don't feel like reading that, at least read this:

> The speed of the scanner is independent of the number of rules or
> (modulo the considerations given at the beginning of this section)
> how complicated the rules are with regard to operators such as '*'
> and '|'.

Therefore, moving those html tokens from lexer.l to an ignore list
*cannot* be faster, even though the resulting scanner (lexer.c) will
be smaller.

If you want to move html words to the ignore list for other reasons (e.g.
so that it's user customizable) that's fine with me.  I'm just saying that
scanner speed alone is a non-reason.

Regards,

-- 
Mark M. Hoffman
mhoffman at lightlink.com



More information about the bogofilter-dev mailing list