Lexer restructure.

Tue Jan 28 08:20:46 CET 2003

David Relson <relson at osagesoftware.com> writes:

> On the way into Dearborn (50 km) this morning, I was thinking about
> your "lexer restructure" message.  About the time I started thinking
> "multiple levels of mime encoding", I had the very same 0.11 thought
> you've had.

I've implemented a MIME aware mail processing library in Ruby where
each layer of the parsing process is represented by a different parser
class.  They are just:

    mailbox parser
    message parser

The mailbox parser is a stupid parser that implements:

    read    - return the next chunk of data from the stream, or EOF if
              at end of the message
    next    - tell the object to advance to the next message
    eof?    - true if true EOF has been reached

You'd use it like:

    while ! mbox_parser.eof?
      parse_message(mbox_parser)
      mbox_parser.next
    end

The message parser just takes an input stream and parses it like a
message.  It parses the header, then depending on the MIME fields in
the header parses the body as a single part or a multipart.  The
multipart parser actually creates NEW message parser objects for each
sub-part and in this way parses them recursively.

For bogofilter, I'd extend this to support in-line decoding of
single-part bodies.  So depending on the content-transfer-encoding,
the message parser would create a "qp-decoder" parser or a
"b64-decoder" parser and pass its own input source to the decoder.
Then it'd read data from the decoder, and tokenize it according to
rules for the content type.

It is important to realize that while parsing a nested multipart
within an mbox file, you might end up with several nested parsers:

    message parser - for one of the parts within the multipart
    message parser - for top level multipart
    mbox parser - to process From: lines

> continuation lines.  It is also the source of whole lines for the
> main lexer to pass to the decoders.  (Note: for plain text, the
> decoder is a no-op.)  A decoded line is passed to an appropriate
> body decoder (plain text, html, other(?)).  Provisions are needed
> for a reader to get a previously decoded line.  The usual file
> protocol which uses buffers and FILE * variables will likely work
> here.  I'm thinking here that a decoded line is stored as a buffer
> which is read by the next reader.
>
> This is the extent of my ideas so far, so I'll quit brainstorming
> :-)

Sounds like you're thinking along similar lines.  I think the above
design is overkill for bogofilter, but its example can get some ideas
flowing.

My Ruby code goes to great pains to be efficient.  E.g. it is not line
oriented, but rather each parser reads data in 16k chunks and "pushes
back" the part of the chunk it doesn't need for the next read.  This
is probably a bigger win under Ruby than it would be in C.