lexer history and thoughts
David Relson
relson at osagesoftware.com
Sun Feb 23 05:37:21 CET 2003
Greetings,
I've been reading flex documentation and thinking about bogofilter's
lexer. Here're some information and thoughts.
Initially, bogofilter had a single lexer, named lexer.l. It handled
message headers, separators ( "^From "), mime boundary strings, tokenizing,
discarding of html tags, etc. It was an all in one piece of
code. Unfortunately there were problems like size and applying rules at
the wrong time. For example, html tag words like "body" would be discarded
whenever encountered - even in headers or in plain text.
Version 2 used 3 lexers. lexer_head.l knows about header fields like
Subject, Content-Type, Mime-Type, etc. lexer_text_plain.l is for plain
text message bodies and knew about ip addresses, tokens,
etc. lexer_text_html.l is for processing html and knows about html tags
and comments and what to do with them. Unfortunately, all three lexers
need to know about message separators since splitting the input stream into
messages is needed for registering mailboxes.
Version 2 has another problem - speed (or the lack there of) for messages
with OLT's (overly long tokens). Flex has the ability to build a "batch"
lexer which promises significantly speed. However a batch lexer does some
look ahead which conflicts with bogofilter's 3 lexer design. The conflict
causes loss of the first line of a body section.
Version 3 of the lexer exists as some ideas of how to solve the current
problems of controlling overall parsing flow and of code duplication. The
basic idea is to have a master parser, perhaps known as the "driver", that
would handle message headers and mime boundaries and would invoke the plain
text or html text parser at the right time. The difficult part is how to
pass text (with mime decoding, if needed) from the driver to the
others. So far nobody knows quite how to do this, so version 3 is an idea,
not an implementation.
Reviewing why bogofilter has 3 lexers we realize it was a solution to a
problem - the problem of how to implement distinct lexer states for
headers, plain text, and html. As I've been reading the flex
documentation, I've been learning and thinking about "start conditions"
i.e. the flex's mechanism for conditionally activating rules. Flex has
special macros BEGIN and YY_STATE for setting and testing the state and
allows patterns to be prefixed with a start condition. The pattern is
active only when one of its start conditions is the current active
state. Looking at Nick's code for processing html tokens and comments I
see that the code is already using start conditions.
Using 3 major start conditions, perhaps called HEAD, TEXT, and HTML (to
denote message headers, plain text, and html text), a single lexer could
have all the patterns that bogofilter needs. With proper use of the start
conditions, rules would only be active when appropriate. (For those who
don't know, a start condition is included at the beginning of a rule and is
enclosed in angle brackets. For example <HEAD> would precede a rule that
only applies in the message header). As a further example, the rule for an
html comment would be preceded by <HTML>.
A combination lexer with start conditions solves several other problems as
well. As a single lexer needs only one input source, for example stdin,
there's no problem passing text between lexers. This also means there is
no need for the header lexer to pass control to a body lexer so batch mode
becomes usable and bogofilter will no longer lose a line of text.
Unless, I am mistaken, we should change the plan for version 3 from a 3
lexer, master/slave design to a 1 lexer start condition design.
Now a couple of developer questions:
Nick: You've been working on the lexer and investigating batch mode and
the problems attendant to it. What do you think?
Matthias: You had the insight that bogofilter's lexer needed to be
partitioned because of the problems with the single lexer. Having
partitioned the lexing process as 3 distinct lexers we have learned some of
the pitfalls of that approach. What do you think of the idea of using
start conditions to provide the needed functional partition?
Me? I think it's an approach that has promise. At the very least it's
worth implementing a prototype to see if the approach helps and to learn
about its drawbacks.
... and that's what I've been reading and thinking about today ...
G'night.
David
More information about the bogofilter-dev
mailing list