lexer history and thoughts

Sun Feb 23 05:37:21 CET 2003

Greetings,

I've been reading flex documentation and thinking about bogofilter's 
lexer.  Here're some information and thoughts.

Initially, bogofilter had a single lexer, named lexer.l.  It handled 
message headers, separators ( "^From "), mime boundary strings, tokenizing, 
discarding of html tags, etc.  It was an all in one piece of 
code.  Unfortunately there were problems like size and applying rules at 
the wrong time.  For example, html tag words like "body" would be discarded 
whenever encountered - even in headers or in plain text.

Version 2 used 3 lexers.  lexer_head.l knows about header fields like 
Subject, Content-Type, Mime-Type, etc.  lexer_text_plain.l is for plain 
text message bodies and knew about ip addresses, tokens, 
etc.  lexer_text_html.l is for processing html and knows about html tags 
and comments and what to do with them.  Unfortunately, all three lexers 
need to know about message separators since splitting the input stream into 
messages is needed for registering mailboxes.

Version 2 has another problem - speed (or the lack there of) for messages 
with OLT's (overly long tokens).  Flex has the ability to build a "batch" 
lexer which promises significantly speed.  However a batch lexer does some 
look ahead which conflicts with bogofilter's 3 lexer design.  The conflict 
causes loss of the first line of a body section.

Version 3 of the lexer exists as some ideas of how to solve the current 
problems of controlling overall parsing flow and of code duplication.  The 
basic idea is to have a master parser, perhaps known as the "driver", that 
would handle message headers and mime boundaries and would invoke the plain 
text or html text parser at the right time.  The difficult part is how to 
pass text (with mime decoding, if needed) from the driver to the 
others.  So far nobody knows quite how to do this, so version 3 is an idea, 
not an implementation.

Reviewing why bogofilter has 3 lexers we realize it was a solution to a 
problem - the problem of how to implement distinct lexer states for 
headers, plain text, and html.  As I've been reading the flex 
documentation, I've been learning and thinking about "start conditions" 
i.e. the flex's mechanism for conditionally activating rules.  Flex has 
special macros BEGIN and YY_STATE for setting and testing the state and 
allows patterns to be prefixed with a start condition.  The pattern is 
active only when one of its start conditions is the current active 
state.  Looking at Nick's code for processing html tokens and comments I 
see that the code is already using start conditions.

Using 3 major start conditions, perhaps called HEAD, TEXT, and HTML (to 
denote message headers, plain text, and html text), a single lexer could 
have all the patterns that bogofilter needs.  With proper use of the start 
conditions, rules would only be active when appropriate.  (For those who 
don't know, a start condition is included at the beginning of a rule and is 
enclosed in angle brackets.  For example <HEAD> would precede a rule that 
only applies in the message header).  As a further example, the rule for an 
html comment would be preceded by <HTML>.

A combination lexer with start conditions solves several other problems as 
well.  As a single lexer needs only one input source, for example stdin, 
there's no problem passing text between lexers.  This also means there is 
no need for the header lexer to pass control to a body lexer so batch mode 
becomes usable and bogofilter will no longer lose a line of text.

Unless, I am mistaken, we should change the plan for version 3 from a 3 
lexer, master/slave design to a 1 lexer start condition design.

Now a couple of developer questions:

Nick:  You've been working on the lexer and investigating batch mode and 
the problems attendant to it.  What do you think?

Matthias:  You had the insight that bogofilter's lexer needed to be 
partitioned because of the problems with the single lexer.  Having 
partitioned the lexing process as 3 distinct lexers we have learned some of 
the pitfalls of that approach.  What do you think of the idea of using 
start conditions to provide the needed functional partition?

Me?  I think it's an approach that has promise.  At the very least it's 
worth implementing a prototype to see if the approach helps and to learn 
about its drawbacks.

... and that's what I've been reading and thinking about today ...

G'night.

David