relson at osagesoftware.com
Mon Jan 27 23:20:21 EST 2003
On the way into Dearborn (50 km) this morning, I was thinking about your
"lexer restructure" message. About the time I started thinking "multiple
levels of mime encoding", I had the very same 0.11 thought you've had.
In the following comments, I'll be referring to lexers, tokenizers, and
readers. The next generation of bogofilter's lexer, tokenizer, and readers
may be similar to what we now have or they may be very different. The
lexer may be a flex grammar like we now have or it may be something
else. Time will tell.
In my thoughts the "main lexer" would know such things as header constructs
and mime boundaries and how to shift between them. In body mode, it would
call decoders as needed. Also, upon encountering mime boundaries it would
know to terminate the current body and/or start processing a mime part
header and/or start a new level and/or stop the current level.
To keep this all from being boringly simple there're the input
routines. At the lowest level, bogofilter reads from stdin (or other
file). This reader needs (I think) to handle unfolding of header
continuation lines. It is also the source of whole lines for the main
lexer to pass to the decoders. (Note: for plain text, the decoder is a
no-op.) A decoded line is passed to an appropriate body decoder (plain
text, html, other(?)). Provisions are needed for a reader to get a
previously decoded line. The usual file protocol which uses buffers and
FILE * variables will likely work here. I'm thinking here that a decoded
line is stored as a buffer which is read by the next reader.
This is the extent of my ideas so far, so I'll quit brainstorming :-)
At 09:12 PM 1/27/03, Matthias Andree wrote:
>after thinking about the lexer restructuring, it is a big change and
>thus 0.11 stuff, I won't do this before 0.10.x goes stable. It would be
>one of those last-minute changes done in a hurry and cause more loose
>ends than we can tie up in due time.
I agree whole-heartedly.
>My current ideas, loosely gathered, are:
>- make lexer_head into a full MIME structure parser, name it
> lexer_structure. The problem is that this function is low-level and
> high-level at the same time. It might be easier if either it could
> stream out its input as driver -- makes me wonder if this part should
> usurp most of the main() and collect() code and drive everything else
>- write a function that calls this structure lexer, decodes and buffers
> data up to a maximum line size.
>- yyinput for those lexer_text_* will read from that buffer and emit
>However, the exact interface is still unclear, the call graph worries me
>a bit. It seems I need lexer_text_* to call back into the
>lexer_structure stuff to fill the buffer (maybe yywrap can help here),
>but I wonder how much this will poison lexer_structure because it's
>actually two in one: as slave to lexer_text_* to fill the buffer, and as
>driver for the lexer_text_*. Talk about cyclic calls and ugliness.
Unclear to me, too. I'm thinking each mime level has a struct akin to our
current msg_state. The struct contains buffer info (address, size
(maximum), and byte count (current usage)), lexer info (entry point, yytext
address, yyleng address, etc), read info (function address, etc), plus ...
>Ideas, brainstorming etc. are welcome, this task is in "planning" phase.
>Don't hold back your idea if it sounds stupid to you, someone else may
>derive the solution from it.
>The easy way out would be to buffer whole mime parts and use lexers on
>them, but that's extremely memory intensive, and bounded memory use
>would not only be nice to have, but a requirement for big-scale
>deployment. A site can't afford to run 50 bogofilter processes at the
>same time, it will eat up their memory on big mails even if they have
I think we can avoid buffering a whole message. The current approach
avoids that buffering and, though the approach we're discussing is
different (with better design and structure and, likely, multiple levels of
lexer, tokenizer, and reader - all interacting), I bet it can be
implemented with the current o(1) memory usage.
More information about the Bogofilter-dev