Lexer restructure.

Tue Jan 28 05:20:21 CET 2003

Matthias,

On the way into Dearborn (50 km) this morning, I was thinking about your 
"lexer restructure" message.  About the time I started thinking "multiple 
levels of mime encoding", I had the very same 0.11 thought you've had.

In the following comments, I'll be referring to lexers, tokenizers, and 
readers.  The next generation of bogofilter's lexer, tokenizer, and readers 
may be similar to what we now have or they may be very different.  The 
lexer may be a flex grammar like we now have or it may be something 
else.  Time will tell.

In my thoughts the "main lexer" would know such things as header constructs 
and mime boundaries and how to shift between them.  In body mode, it would 
call decoders as needed.  Also, upon encountering mime boundaries it would 
know to terminate the current body and/or start processing a mime part 
header and/or start a new level and/or stop the current level.

To keep this all from being boringly simple there're the input 
routines.  At the lowest level, bogofilter reads from stdin (or other 
file).  This reader needs (I think) to handle unfolding of header 
continuation lines.  It is also the source of whole lines for the main 
lexer to pass to the decoders.  (Note: for plain text, the decoder is a 
no-op.)  A decoded line is passed to an appropriate body decoder (plain 
text, html, other(?)).  Provisions are needed for a reader to get a 
previously decoded line.  The usual file protocol which uses buffers and 
FILE * variables will likely work here.  I'm thinking here that a decoded 
line is stored as a buffer which is read by the next reader.

This is the extent of my ideas so far, so I'll quit brainstorming :-)

At 09:12 PM 1/27/03, Matthias Andree wrote:

>Hi,
>
>after thinking about the lexer restructuring, it is a big change and
>thus 0.11 stuff, I won't do this before 0.10.x goes stable. It would be
>one of those last-minute changes done in a hurry and cause more loose
>ends than we can tie up in due time.

I agree whole-heartedly.

>My current ideas, loosely gathered, are:
>
>- make lexer_head into a full MIME structure parser, name it
>   lexer_structure. The problem is that this function is low-level and
>   high-level at the same time. It might be easier if either it could
>   stream out its input as driver -- makes me wonder if this part should
>   usurp most of the main() and collect() code and drive everything else
>   itself.
>
>- write a function that calls this structure lexer, decodes and buffers
>   data up to a maximum line size.
>
>- yyinput for those lexer_text_* will read from that buffer and emit
>   tokens.
>
>However, the exact interface is still unclear, the call graph worries me
>a bit. It seems I need lexer_text_* to call back into the
>lexer_structure stuff to fill the buffer (maybe yywrap can help here),
>but I wonder how much this will poison lexer_structure because it's
>actually two in one: as slave to lexer_text_* to fill the buffer, and as
>driver for the lexer_text_*. Talk about cyclic calls and ugliness.

Unclear to me, too.  I'm thinking each mime level has a struct akin to our 
current msg_state.  The struct contains buffer info (address, size 
(maximum), and byte count (current usage)), lexer info (entry point, yytext 
address, yyleng address, etc), read info (function address, etc), plus ...

>Ideas, brainstorming etc. are welcome, this task is in "planning" phase.
>Don't hold back your idea if it sounds stupid to you, someone else may
>derive the solution from it.
>
>
>The easy way out would be to buffer whole mime parts and use lexers on
>them, but that's extremely memory intensive, and bounded memory use
>would not only be nice to have, but a requirement for big-scale
>deployment. A site can't afford to run 50 bogofilter processes at the
>same time, it will eat up their memory on big mails even if they have
>512 MB.

I think we can avoid buffering a whole message.  The current approach 
avoids that buffering and, though the approach we're discussing is 
different (with better design and structure and, likely, multiple levels of 
lexer, tokenizer, and reader - all interacting), I bet it can be 
implemented with the current o(1) memory usage.