The joy of buffer switching....

Mon Feb 24 17:02:08 CET 2003

Nick,

Such fun...

At 10:25 AM 2/24/03, Nick Simicich wrote:
>I have spent several more hours trying to work on moving buffers from 
>flexer to flexer.  I am convinced that the approach will not work.  It is 
>supposed to, but it does not.
>
>The specific issue is this:
>
>(gdb) p *yy_current_buffer
>$29 = {yy_input_file = 0x40231ce0,
>   yy_ch_buf = 0x80e42f0 "\nFrom glamb at dynagen.on.ca  Fri Nov  1 06:20:26 
> 2002\n56 at wall.org>\n",
>   yy_buf_pos = 0x80e42f1 "From glamb at dynagen.on.ca  Fri Nov  1 06:20:26 
> 2002\n56 at wall.org>\n", yy_buf_size = 16384, yy_n_chars = 1, 
> yy_is_our_buffer = 1,
>   yy_is_interactive = 1, yy_at_bol = 1, yy_fill_buffer = 1,
>   yy_buffer_status = 1}
>(gdb) p yy_n_chars
>$30 = 2
>(gdb)
>We are in the call where the buffer is being extracted. We have moved this 
>buffer from head-to-text, and that worked. We are moving the buffer 
>text-to-head. This is the first time we are extracting a buffer from the 
>plain text flexer in yy_switch_to_buffer(new_buffer).  The code that is 
>about to be executed is going to completely screw up the buffer - it will 
>overlay the 'o' in from with a null.  It has parsed the "From " token out, 
>and we should be saving the rest of the line for the next state.
>
>I am through working on this for the day.  If someone else can come up 
>with a workable buffer swapping scheme, I will certainly listen, or if 
>someone can tell me what I am doing - I am still essentially running the 
>patched I posted yesterday, except I turned off optimization so that I can 
>run gdb more easily and so that the trace commands work more predictably.
>
>THIS IS SUPPOSED TO WORK, as far as I can tell on the man page.  You are 
>*SUPPOSED* to be able to stash a partially processed buffer, then go off 
>and do something else with the lexer, then return to the buffer, with the 
>buffer holding the state for where you are in the input stream.  The 
>buffer swapping is the essence of processing include files.  You hide the 
>input buffer, in your own data structure, switch buffers, and handle the 
>input associated with the new buffer.

I think there's a mismatch between what multi-lexer bogofilter needs and 
what flex does.  For bogofilter, we want to use one lexer to process part 
of the buffer, then switch to a different lexer to process the next part of 
the buffer, then return to the first lexer to process the next 
part.  Stated differently, we want one buffer to allow different lexers to 
work on it.

As I read the documentation on flex's support for multiple buffers, I see 
something different.  Flex works with the first buffer, then with the 
second, then returns to the first _at_the_same_place_ as before.  For 
bogofilter, we want to process the buffer starting at a _different_ place.

Given that we want to use one buffer (for batch efficiency), flex limits us 
to one parser.

Of course, the current bogofilter has 3 lexers and 3 buffers.  The big 
difference is that we are using flex in a line by line mode (interactive) 
rather than a batch mode.

There are at least two reasons to convert back to a single lexer.  First, 
we want better head/body control in our parsing.  Using flex start 
conditions, i.e. states, that can be done.  Second, we want the improved 
efficiency of batch mode.

I'm working on a single lexer _with_ states.  With luck I'll have it 
working in a day or two.  Then we can get back to the issue of performance.

>Doing the yy_switch_to_buffer() is supposed to take the variables that are 
>up in the air inside the lexer and stash them inside the buffer's state 
>variables. But it just is not working.  I spent a while traceing this, 
>watching it go wrong, until I realized that the yy_switch_to_buffer was 
>hosing the buffer.]

As a thought, if indeed the yy_switch_to_buffer() code (or other C code) is 
defective, there's a chance we could correct it and submit a patch to 
flex.  Given that thought, I poked around to find the origins of that 
code.  Running "strings /usr/bin/flex" it appears that the code is built 
into the executable.  If we developed a patch to fix buffer switching, the 
patch would have to applied after flex converts the .l file to a .c 
file.  This approach would be further complicated by supporting different 
versions of flex (and/or lex) on different versions of linux and of the 
other operating systems on which bogofilter runs.  I don't think this 
approach would be successful.

>I have two approaches.  One is that I should be calling 
>yy_switch_to_buffer from within the rule rather than from outside.  I will 
>try adding the code to the processing of "From".
>
>If that fails, then I will work on forcing in EOFs and moving detection of 
>From, mime boundaries and header ends to the code outside of the lexer.
>
>Feeding the lexer artificial EOFs at the end of every section is probably 
>clean enough to work unconditionally.

Artificial EOF's sound reasonable and I suspect they would work.

Returning to the master/slave design, another approach would be to have the 
master separate out each mime body part (with decoding) and pass that (as a 
buffer) to the slave.  It would use additional memory, but it should be 
much simpler than all the complicated buffer contortions you're encountering.

David