lexer history and thoughts

Sun Feb 23 15:04:19 CET 2003

David Relson <relson at osagesoftware.com> writes:

> Version 3 of the lexer exists as some ideas of how to solve the current
> problems of controlling overall parsing flow and of code duplication.
> The basic idea is to have a master parser, perhaps known as the
> "driver", that would handle message headers and mime boundaries and
> would invoke the plain text or html text parser at the right time.  The
> difficult part is how to pass text (with mime decoding, if needed) from
> the driver to the others.  So far nobody knows quite how to do this, so
> version 3 is an idea, not an implementation.

Well, the "brute force" approach will work, at the expense of memory
(aka. the Perl way): decode the full attachment or body part into a
buffer (provided it really is text/* or something else that we tokenize)
and then have the slave lexer read from that buffer. The other approach
is give the master two functions: 1. decide which slave to call, call it
and 2. serve data (or EOF at boundaries of MIME parts) to the slave.

However, even on fat 32bit machines, the memory approach is calling for
trouble and process concurrency limits of the "don't run more than four
bogofilters at the same time": say the message size limit is 16 MByte,
we might then end up with up to like 50 MB of memory used for a single
bogofilter process, 2 * 16 MByte (one for the -p mode copy, one for the
huge single text-mode part), some MByte for the DB cache, then for
mapping the libraries, program code and so on. We DO need to be more
conservative with memory if we want to keep bogofilter suitable for
company-size systems. It's not the code quality that worries me, but the
size.

> Reviewing why bogofilter has 3 lexers we realize it was a solution to a
> problem - the problem of how to implement distinct lexer states for
> headers, plain text, and html.  As I've been reading the flex
> documentation, I've been learning and thinking about "start conditions"
> i.e. the flex's mechanism for conditionally activating rules.  Flex has
> special macros BEGIN and YY_STATE for setting and testing the state and
> allows patterns to be prefixed with a start condition.  The pattern is
> active only when one of its start conditions is the current active
> state.  Looking at Nick's code for processing html tokens and comments I
> see that the code is already using start conditions.

Hum, I wonder why they name their stuff "start condition"... It's rather
a "sticky state" and thus the opposite of a START condition. It's a
second layer of states above the other state automaton. That's why I
skipped the whole section the last time I read the manual. But it looks
possible to use it to solve this problem at first glance.

> Using 3 major start conditions, perhaps called HEAD, TEXT, and HTML (to
> denote message headers, plain text, and html text), a single lexer could
> have all the patterns that bogofilter needs.

This will mix all three lexers in one though, and I wonder what impact
that has on speed and size -- and maintainability. It will bring back
(or keep alive) the old problem "what and when do we decode" with the
bloody ^From detection. I had hoped to fix this with the split-lexers
approach. The problem is, we can only AFTER reading a line decide if we
need to decode it -- and if we only have one lexer, how do we
efficiently re-parse the decoded line? The problem has always been:

1. we need to read the line and figure it's not a boundary
2. we may then need to feed it through qp or b64 decoders
3. we then need to parse the decoded line

With the additional problem of input lines not matching logical lines,
with qp soft line breaks (trailing = character) or base64 (that is
totally unaware of line format).

So start conditions would reflect the current approach in different
formulation. I wonder how this would prevent losing the line when the
encoding is switched or something. I think we still need one "driver"
(that is aware of MIME structure and encodings) and one "tokenizer"
(that should know nothing of --boundary lines or ^From lines because
that's the driver's task and because it only sees as much data as the
prevalent MIME part comprises)

> A combination lexer with start conditions solves several other problems
> as well.  As a single lexer needs only one input source, for example
> stdin, there's no problem passing text between lexers.  This also means
> there is no need for the header lexer to pass control to a body lexer so
> batch mode becomes usable and bogofilter will no longer lose a line of
> text.

The line still goes lost unless you kill the decoding, unless I am
mistaken about the lexer inserting (splicing) tokens _into_ its own
input stream. I wonder if REJECT or input() or yyless() can do that
effectively enough.

OTOH, we might redefine yywrap() for slave lexers to call back into the
master lexer -- we'll have to figure if

ab<yywrap>cd will result in the slave reading abcd as single token though.

The "Multiple buffers" section in flex.info has some info on this issue.

Another idea that comes from brainstorming is instrumenting
YY_USER_ACTION to do the decoding. Not sure if that works.

On the overly long tokens issue, does "flex -p" and "flex -b" yield
useful points where the lexers can be optimized?

> Matthias:  You had the insight that bogofilter's lexer needed to be
> partitioned because of the problems with the single lexer.  Having
> partitioned the lexing process as 3 distinct lexers we have learned some
> of the pitfalls of that approach.  What do you think of the idea of
> using start conditions to provide the needed functional partition?

I don't think it's better than what we have now. We absolutely must
clean up the decoding vs. MIME boundary detection, else we're getting
nowhere. We'll just wind up fixing corner cases the same way as you're
doing now.

I have presented some alternative approaches, none of them is yet
thought out to the end, and I think the "multiple input buffers" and
"yywrap()" approaches might enable our master/slave design. I'm not sure
yet.

Another idea that came to mind recently is: fork the slave and pipe the
input into it (threads might have their value here but I'm unsure if we
want threads for portability and robustness, if one thread crashes, the
whole application falls over and they're not trivial to debug).

BTW, the performance problem with overly long tokens is a regression
that must've crept in within the past week, maybe 10 days. My version
0.10.3.cvs.20030216 is quick (900,000 bytes input in 30 ms CPU time),
while 0.10.3.1.cvs.20030221 is a snail (same input in 3,220 ms CPU
time). Interestingly enough, the problem is smaller but still clearly
visible with a third of the input (300,000 bytes, 30 vs. 390 ms). So
some O(n^2) has been inflicted upon the code recently. Do you have ideas
what change may have caused this? If not, I'll have to do a binary
search.

BTW, I'm using this to test:

perl -e 'print "Header: xy\n\n", "ABC" x 300000, "\n"' \
| /path/to/bogolexer -v

-- 
Matthias Andree