unified lexer
David Relson
relson at osagesoftware.com
Tue Feb 25 03:16:21 CET 2003
Nick & Matthias,
Attached is a tarball containing two files:
new_lexer.l - a first draft of a unified lexer using states (start
conditions) to identify rules for HEAD, HTML, and TEXT (plain).
new.lexer.patch.0224.txt - a patch (relative to cvs) that replaces the 3
lexers with the unified lexer.
The code works (mostly) though it's a bit rough around the edges, i.e. may
return an extra token or two or may miss one or two. It doesn't include
Nick's html_tokenize changes (tomorrow, perhaps). It contains some old
TOKEN patterns.
As I said, not quite ready for prime time - but it's close.
Nick - you might try your batch mode performance tests on this new
code. By combining functions in one place, the need for juggling multiple
lexer buffers goes away. May you have a productive night and produce a
breakthrough in performance :-)
Matthias - bang away at it. If Nick's breakthrough doesn't happen, I'll
commit my check_alphanum() tomorrow so we'll have some speed (even if we
don't have beauty and elegance).
Now, 'tis time for me to take a break and hang out with the family.
David
More information about the bogofilter-dev
mailing list