leser state machine [was: uuencoded attachments produce woe]

Sun Dec 8 16:06:57 CET 2002

At 09:13 AM 12/8/02, Matthias Andree wrote:

>I'm wondering if it would be hard to make lexer.l parsing stateful; no
>need to split that stuff.
>
>My idea is: parse MIME-Header (REQUIRED!), Content-Type (multipart or
>message are the interesting primary types here),
>Content-Transfer-Encoding and if multipart, toss the boundary line on a
>stack, and the boundary parser will look at the stack to figure if it's
>a valid boundary line and how many levels to pop from the stack -- we'd
>only need to be able to push one token back to the parser and block a
>particular rule if we want it to be done in a simple way and how much
>performance and size impact that has (if we need REJECT rules, for
>example).

Matthias,

I think we're moving in that direction, i.e. stateful.  The lexer already 
uses states.  Two examples of that are the past_header and url: code.  I 
see more of that coming.

A couple of times in the past decade, I've used yacc to advantage.  As a 
tool I don't use much or know well, I've found it difficult to use.

Gyepi wants to use the eps library.  He's used it before and praises it.  I 
need to go read about it.  FYI, the url is http://inter7.com/eps and eps 
means "Email Processing System".

On the subject of boundary lines, every once in a while I dump my wordlists 
with bogoutil and look at the output.  In addition to normal tokens, I see 
lots of korean, numbers, urls, and boundary lines.  They all use space and 
I don't think they significantly help classification.  Urls may be useful, 
but I doubt the others are of value.