lexer history and thoughts

Sun Feb 23 15:44:00 CET 2003

Greetings Matthias,

At 09:04 AM 2/23/03, Matthias Andree wrote:

>David Relson <relson at osagesoftware.com> writes:

... [snip]...

> > Reviewing why bogofilter has 3 lexers we realize it was a solution to a
> > problem - the problem of how to implement distinct lexer states for
> > headers, plain text, and html.  As I've been reading the flex
> > documentation, I've been learning and thinking about "start conditions"
> > i.e. the flex's mechanism for conditionally activating rules.  Flex has
> > special macros BEGIN and YY_STATE for setting and testing the state and
> > allows patterns to be prefixed with a start condition.  The pattern is
> > active only when one of its start conditions is the current active
> > state.  Looking at Nick's code for processing html tokens and comments I
> > see that the code is already using start conditions.
>
>Hum, I wonder why they name their stuff "start condition"... It's rather
>a "sticky state" and thus the opposite of a START condition. It's a
>second layer of states above the other state automaton. That's why I
>skipped the whole section the last time I read the manual. But it looks
>possible to use it to solve this problem at first glance.

I wondered about the name as well.  Perhaps they're thinking that of it as 
the first condition that a rule must satisfy, i.e. a condition at the start 
of the rule...

> > Using 3 major start conditions, perhaps called HEAD, TEXT, and HTML (to
> > denote message headers, plain text, and html text), a single lexer could
> > have all the patterns that bogofilter needs.
>
>This will mix all three lexers in one though, and I wonder what impact
>that has on speed and size -- and maintainability. It will bring back
>(or keep alive) the old problem "what and when do we decode" with the
>bloody ^From detection. I had hoped to fix this with the split-lexers
>approach. The problem is, we can only AFTER reading a line decide if we
>need to decode it -- and if we only have one lexer, how do we
>efficiently re-parse the decoded line?

Remember, the decoding is done by _our_ input routine - before lex even 
sees the line.  If memory serves, there's a special check for "^From " in 
the input routine.  Perhaps not the perfect design, but I find it acceptable.

>The problem has always been:
>
>1. we need to read the line and figure it's not a boundary
>2. we may then need to feed it through qp or b64 decoders
>3. we then need to parse the decoded line
>
>With the additional problem of input lines not matching logical lines,
>with qp soft line breaks (trailing = character) or base64 (that is
>totally unaware of line format).
>
>So start conditions would reflect the current approach in different
>formulation. I wonder how this would prevent losing the line when the
>encoding is switched or something. I think we still need one "driver"
>(that is aware of MIME structure and encodings) and one "tokenizer"
>(that should know nothing of --boundary lines or ^From lines because
>that's the driver's task and because it only sees as much data as the
>prevalent MIME part comprises)

I haven't stepped through the code, but I presume the line is lost in the 
change between lexers.  Using lexer states, there's only one lexer, so 
there's no place to lose the line.

> > A combination lexer with start conditions solves several other problems
> > as well.  As a single lexer needs only one input source, for example
> > stdin, there's no problem passing text between lexers.  This also means
> > there is no need for the header lexer to pass control to a body lexer so
> > batch mode becomes usable and bogofilter will no longer lose a line of
> > text.
>
>The line still goes lost unless you kill the decoding, unless I am
>mistaken about the lexer inserting (splicing) tokens _into_ its own
>input stream. I wonder if REJECT or input() or yyless() can do that
>effectively enough.

I hadn't known about yyless() until yesterday.  I did a little 
experimenting to see if it could replace our yyredo() function, which I 
find a bit awkward.  In some thing like:

^MIME-Version:.* { mime_version(yy_text()); 
yyless(strlen("MIME-Version:")); return (TOKEN); }

bogofilter gets "mime-version:" as a token (not the colon).  If get_token() 
removes the trailing token, all the regression tests pass.  Without the 
"return (TOKEN)", bogofilter never gets a "mime-version" token (which makes 
the regression tests unhappy).

The above rule could be optimized via numeric constant instead of strlen(), 
though that's dangerous.  Alternatively, mime_version() could return a 
count to allow a rule like:

^MIME-Version:.* { yyless(mime_version(yy_text())); return (TOKEN); }

I've not experimented with REJECT or input() and don't know how effective 
they are.

>OTOH, we might redefine yywrap() for slave lexers to call back into the
>master lexer -- we'll have to figure if
>
>ab<yywrap>cd will result in the slave reading abcd as single token though.
>
>The "Multiple buffers" section in flex.info has some info on this issue.
>
>Another idea that comes from brainstorming is instrumenting
>YY_USER_ACTION to do the decoding. Not sure if that works.
>
>
>
>On the overly long tokens issue, does "flex -p" and "flex -b" yield
>useful points where the lexers can be optimized?

Haven't tried them ...

> > Matthias:  You had the insight that bogofilter's lexer needed to be
> > partitioned because of the problems with the single lexer.  Having
> > partitioned the lexing process as 3 distinct lexers we have learned some
> > of the pitfalls of that approach.  What do you think of the idea of
> > using start conditions to provide the needed functional partition?
>
>I don't think it's better than what we have now. We absolutely must
>clean up the decoding vs. MIME boundary detection, else we're getting
>nowhere. We'll just wind up fixing corner cases the same way as you're
>doing now.
>
>I have presented some alternative approaches, none of them is yet
>thought out to the end, and I think the "multiple input buffers" and
>"yywrap()" approaches might enable our master/slave design. I'm not sure
>yet.
>
>Another idea that came to mind recently is: fork the slave and pipe the
>input into it (threads might have their value here but I'm unsure if we
>want threads for portability and robustness, if one thread crashes, the
>whole application falls over and they're not trivial to debug).
>
>BTW, the performance problem with overly long tokens is a regression
>that must've crept in within the past week, maybe 10 days. My version
>0.10.3.cvs.20030216 is quick (900,000 bytes input in 30 ms CPU time),
>while 0.10.3.1.cvs.20030221 is a snail (same input in 3,220 ms CPU
>time). Interestingly enough, the problem is smaller but still clearly
>visible with a third of the input (300,000 bytes, 30 vs. 390 ms). So
>some O(n^2) has been inflicted upon the code recently. Do you have ideas
>what change may have caused this? If not, I'll have to do a binary
>search.

I don't think there have been changes in that area.  gprof indicated that 
the two leading time consumers are

89.12% 374.17s yy_get_previous_state
10.82%  45.44s yy_geet_next_buffer

  00.06%   0.24s everything else

This is deep inside the flex code - far from anything you or I have written.

>BTW, I'm using this to test:
>
>perl -e 'print "Header: xy\n\n", "ABC" x 300000, "\n"' \
>| /path/to/bogolexer -v

You might want to test with Greg's actual files.  I've already sent you a 
URL for them.

>--
>Matthias Andree
>
>---------------------------------------------------------------------
>FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
>To unsubscribe, e-mail: bogofilter-dev-unsubscribe at aotto.com
>For summary digest subscription: bogofilter-dev-digest-subscribe at aotto.com
>For more commands, e-mail: bogofilter-dev-help at aotto.com