lexer redesign [was: the recent ^From issues.]

Mon Jan 27 14:19:52 CET 2003

At 07:47 AM 1/27/03, Matthias Andree wrote:

>Matthias Andree <matthias.andree at gmx.de> writes:
>
> > My original idea was to have one lexer (lexer_head.l) to gather the
> > structure, and pass decoded stuff down to the "token extracting"
> > lexers. Given that "^From " lines will never be encoded, this is
> > clean.
> >
> > Any rules that are aware of the message or MIME structure in
> > lexer_text_{plain,html}.l are clearly misplaced under these assumptions.
>
>To refine these thoughts, and after looking into token.c, the LEXER
>state switching is wrong. We need to always run lexer_lex() first, and
>if and only if that is in a "body" mode, decode the lines it gathered
>down to the according text_*_lex() functions. This will allow us to put
>the whole decoding, structure detection and so on into lexer_head.l and
>make lexer_text_*.l simple and robust.
>
>Any objections?

Matthias,

I don't have time right now to look into this.  Simple and robust sounds 
good.  As you've mentioned, it is presently lacking.  The various tests for 
"msg_header || msg_state->mime_header" are how bogofilter currently 
determines whether it is in header or body mode.  If you have a better 
design, that'd be great.

I do have a request.  Once you have the code and the regression tests work, 
send out a patch that we can all test.  If it passes all tests, then we can 
release it to cvs.  If it's clearly superior then it belongs in 
0.10.1.2.  If not, I'll go ahead with my current 0.10.1.2 plans.

O.K.?

David