[patch] small lexer changes

Mon Oct 7 22:46:22 CEST 2002

At 04:24 PM 10/7/02, Matthias Andree wrote:
>On Mon, 07 Oct 2002, Mark M. Hoffman wrote:
>
> > 1. Tighten specification for IP addresses.
>
>If you want it really tight, how about something along these lines:
>
>INT8: ([01]?[0-9]?[0-9]|2([0-4][0-9]|5[0-5]))
>IPADDR: {INT8}\.{INT8}\.{INT8}\.{INT8}
>
> > 2. Ignore ESMTP ids (in addition to SMTP ids)
> > 3. Fix lexertest interface to lexer bug, which matters not now
> >    but will later.
>
>These are fine with me, I hacked this up pretty quick and I'm not really
>aware of [f]lex interfaces.
>
>I'm just wondering if we could reduce the lexer.c code size without
>sacrificing too much speed. lexer.c is >46,000 lines here, which pretty
>much stinks.

lexer.c is long because of the list of html tags words that are recognized 
and discarded.  Taking them out makes it much much shorter and is something 
I've done while testing its parsing.  However that would likely cost time 
during analysis - all those words to lookup in the word lists.

One can always do the speed experiment.  I'd recommend two samples - html 
and non-html.