unified lexer

Tue Feb 25 03:16:21 CET 2003

Nick & Matthias,

Attached is a tarball containing two files:

new_lexer.l - a first draft of a unified lexer using states (start 
conditions) to identify rules for HEAD, HTML, and TEXT (plain).

new.lexer.patch.0224.txt - a patch (relative to cvs) that replaces the 3 
lexers with the unified lexer.

The code works (mostly) though it's a bit rough around the edges, i.e. may 
return an extra token or two or may miss one or two.  It doesn't include 
Nick's html_tokenize changes (tomorrow, perhaps).  It contains some old 
TOKEN patterns.

As I said, not quite ready for prime time - but it's close.

Nick - you might try your batch mode performance tests on this new 
code.  By combining functions in one place, the need for juggling multiple 
lexer buffers goes away.  May you have a productive night and produce a 
breakthrough in performance :-)

Matthias - bang away at it.  If Nick's breakthrough doesn't happen, I'll 
commit my check_alphanum() tomorrow so we'll have some speed (even if we 
don't have beauty and elegance).

Now, 'tis time for me to take a break and hang out with the family.

David