Lexer restructure.

Matthias Andree matthias.andree at gmx.de
Wed Jan 29 01:27:02 CET 2003


David Relson <relson at osagesoftware.com> writes:

> To keep this all from being boringly simple there're the input routines.
> At the lowest level, bogofilter reads from stdin (or other file).  This
> reader needs (I think) to handle unfolding of header continuation
> lines.

It might be useful in the reader, or the parser itself might do it
before passing things on.

> It is also the source of whole lines for the main lexer to pass to the
> decoders.  (Note: for plain text, the decoder is a no-op.)  A decoded
> line

This is the tricky part, because the line breaks in base64 encoded text
don't match the line breaks in the decoded (original) text --
quoted-printable with soft line breaks (i. e. =$ as regexp) also falls
in this category, but can be worked around much simpler.

I've thought along Matt's lines before, not only for bogofilter, but
also for leafnode, implement a buffered read and save the unused part
for later and prepend it to the input as time comes.

> is passed to an appropriate body decoder (plain text, html, other(?)).

> Provisions are needed for a reader to get a previously decoded line.

This is what I don't see efficiently written in flex.

> Unclear to me, too.  I'm thinking each mime level has a struct akin to
> our current msg_state.  The struct contains buffer info (address, size
> (maximum), and byte count (current usage)), lexer info (entry point,
> yytext address, yyleng address, etc), read info (function address, etc),
> plus ...

The buffer info (particularly with position and length information)
might turn out useful. That way, the whole reader might move to a
separate function which can be called by the main lexer as well as by
the text lexers -- however, we need to be able to switch state in the
middle of a buffer that was already read -- we must not have decoded it
prematurely.

-- 
Matthias Andree




More information about the bogofilter-dev mailing list