What about html_reorder ?

Wed May 12 18:40:46 CEST 2004

On Wed, 12 May 2004 18:15:59 +0200
Ihunda wrote:

> Hi all,
> 
>   I am working on getting the most out of the bogofilter lexer
> speed wise. The less malloc, the less parsing, the better :)
>   That's why this line confuses me:
> 
>   <HTML>{TOKEN_12}({HTMLTOKEN})+/{NOTWHITESPACE}    { html_reorder();
> return TOKEN;}
> 
> And the code after html_reorder that malloc, swap memory and call
> yyunput. I do all the mime parsing before sending each decoded mime
> part to the lexer, setting
> the initial state myself. For example, for an HTML part, the lexer is
> called with initial
> state HTML and the buffer is the HTML part itself.
> 
> That sometimes give me a nice bug otherwise not seen:
> *flex* scanner push-back *overflow
> 
> *Which means that the unput went too far and stepped outside of the
> buffer. That didn't throw
> an error before (when the parser handled the whole email) because
> there was some data before
> the HTML part but that doesn't mean that the bug didn't exist, it just
> didn't crash :)).
> 
> To solve two problems in a row (Yeah, I am that kind of person), what
> about getting rid of
> this all html_reorder thing ?
> 
> Isn't that enough:
>   <HTML>{TOKEN_12}({HTMLTOKEN})+/{NOTWHITESPACE}    { return
> HTMLREORDERTOKEN; }
> 
> And in the C code (token.c), in the big switch:
> case HTMLREORDERTOKEN:
>        real length of token = position of '<'
> 
> Like it's done for HEADKEY.
> 
> So, what do you guys think ?
> 
> Giorgio

Hello Giorgio,

Welcome!  It sounds like you've been busy.

As you've found, dealing with html is a challenge.  Initially html
reordering was done separately from the lexer, i.e. outside of
lexer_v3.l.  After a while, doing it as part lexer_v3.l was found to be
the better route.

Of course, since your background and experience are different from those
of us who wrote the first two versions of the code, you may well be able
to write a third, even better reorder routine.  

As you work, be sure to run "make check" to determine if anything has
gotten broken.  Although "make check" doesn't test everything, it does a
good job of testing.  If all the tests pass with your new code, odds are
that it's right.

I look forward to seeing your results.

Regards,

David

P.S.  The new address for the list is bogofilter-dev at bogofilter.org