Lexer restructure.

Matthias Andree matthias.andree at gmx.de
Tue Jan 28 03:12:26 CET 2003


Hi,

after thinking about the lexer restructuring, it is a big change and
thus 0.11 stuff, I won't do this before 0.10.x goes stable. It would be
one of those last-minute changes done in a hurry and cause more loose
ends than we can tie up in due time.

My current ideas, loosely gathered, are:

- make lexer_head into a full MIME structure parser, name it
  lexer_structure. The problem is that this function is low-level and
  high-level at the same time. It might be easier if either it could
  stream out its input as driver -- makes me wonder if this part should
  usurp most of the main() and collect() code and drive everything else
  itself.

- write a function that calls this structure lexer, decodes and buffers
  data up to a maximum line size.

- yyinput for those lexer_text_* will read from that buffer and emit
  tokens.

However, the exact interface is still unclear, the call graph worries me
a bit. It seems I need lexer_text_* to call back into the
lexer_structure stuff to fill the buffer (maybe yywrap can help here),
but I wonder how much this will poison lexer_structure because it's
actually two in one: as slave to lexer_text_* to fill the buffer, and as
driver for the lexer_text_*. Talk about cyclic calls and ugliness.

Ideas, brainstorming etc. are welcome, this task is in "planning" phase.
Don't hold back your idea if it sounds stupid to you, someone else may
derive the solution from it.


The easy way out would be to buffer whole mime parts and use lexers on
them, but that's extremely memory intensive, and bounded memory use
would not only be nice to have, but a requirement for big-scale
deployment. A site can't afford to run 50 bogofilter processes at the
same time, it will eat up their memory on big mails even if they have
512 MB.


As a side note, I noticed David "fixed" lexer stuff to NUL-terminate
strings. I wonder if that's safe, because we need to pass on NUL bytes
as far as possible -- instead of making low-level functions that do have
a count compatible with C strings (which is bound to wreak havoc when
NUL bytes are in the input because then strlen() doesn't match the
count), the functions that use C strings need to be fixed to use fwrite
and all that.

Cheers,

-- 
Matthias Andree




More information about the bogofilter-dev mailing list