The joy of buffer switching....

Tue Feb 25 14:46:29 CET 2003

At 12:23 AM 2/25/03, Nick Simicich wrote:
>At 11:02 AM 2003-02-24 -0500, David Relson wrote:
>
>>As I read the documentation on flex's support for multiple buffers, I see 
>>something different.  Flex works with the first buffer, then with the 
>>second, then returns to the first _at_the_same_place_ as before.  For 
>>bogofilter, we want to process the buffer starting at a _different_ place.
>
>No, we are resuming the buffer in the same place at which it was 
>suspended.  This is supposed to be something that works.  I see nothing in 
>the world that says, "it is evil to move buffers around," but conversely, 
>the buffer structures are exactly the same (I have checked) and I simply 
>do not see the difference between suspending the buffer by doing a 
>yy_switch_to_buffer and what they do....hmmmmm.
>
>>Given that we want to use one buffer (for batch efficiency), flex limits 
>>us to one parser.
>
>No.  We need to code without errors.  There is a fundamental error in the 
>way that the programs are coded, which I finally found...I think.

My goal is always coding without errors.  Unfortunately it's rarely 
achieved.  Much of the time I'm coding for a situation not fully 
understood, which makes the goal really difficult.  It'd be great if you 
have, indeed, finally found what's been bedeviling you.

>>Of course, the current bogofilter has 3 lexers and 3 buffers.  The big 
>>difference is that we are using flex in a line by line mode (interactive) 
>>rather than a batch mode.
>
>No, it is, as I suspected, an accident that it ever worked.

'Tis possible :-)

>>There are at least two reasons to convert back to a single lexer.  First, 
>>we want better head/body control in our parsing.  Using flex start 
>>conditions, i.e. states, that can be done.  Second, we want the improved 
>>efficiency of batch mode.
>>
>>I'm working on a single lexer _with_ states.  With luck I'll have it 
>>working in a day or two.  Then we can get back to the issue of performance.
>
>Well, the problem is that we can't get the absolute best performance with 
>a combined parser because of "backing up".  The entire parser has to be 
>free of "backing up states" before you can eliminate the backing up code 
>and checks from the parser.  Please let me have one more whack at forcing 
>in eofs so that the separate parsers have a chance.

We neither can nor want to stop your efforts.  If you can find a better way 
to skin the cat, we want it for bogofilter.

>Unless you are sure about how to eliminate backing up with the revised 
>html parser, I believe that the separate parsers will perform better....
>
>>>Doing the yy_switch_to_buffer() is supposed to take the variables that 
>>>are up in the air inside the lexer and stash them inside the buffer's 
>>>state variables. But it just is not working.  I spent a while traceing 
>>>this, watching it go wrong, until I realized that the 
>>>yy_switch_to_buffer was hosing the buffer.]
>>
>>As a thought, if indeed the yy_switch_to_buffer() code (or other C code) 
>>is defective, there's a chance we could correct it and submit a patch to 
>>flex.  Given that thought, I poked around to find the origins of that 
>>code.  Running "strings /usr/bin/flex" it appears that the code is built 
>>into the executable.  If we developed a patch to fix buffer switching, 
>>the patch would have to applied after flex converts the .l file to a .c 
>>file.  This approach would be further complicated by supporting different 
>>versions of flex (and/or lex) on different versions of linux and of the 
>>other operating systems on which bogofilter runs.  I don't think this 
>>approach would be successful.
>
>The way that flex works is that there are two parts, essentially:  The 
>sort of semi-fixed part is the generated code and the large tables that 
>each character is looked up in, and which either result in an action, or 
>in another state which can then be looked up.  This is why it is 
>essentially a linear action to have a very large number of patterns.  The 
>more patterns there are, the more work goes in to building the tables, but 
>then once that is done, a good implementation should be able to be matched 
>against each of the characters, pretty much in linear time.  I have a 
>program which matches a complex regular expression against the mail log in 
>real time, searching for mail sent to spamtraps, so that the ip addresses 
>of the transmitting sites can be added to a blocklist.  (this is 
>maintained as blocklist4.squawk.com, and regularly contains between 700 
>and 1000 different sites that have tried to mail a spamtrap address at 
>squawk.com within the last week.  A total of 9,121 addresses have ever 
>been in this blocklist since I started keeping stats (that is the size of 
>the table that is keyed by ip address).

The table driven design and its advantages are, I think, well understood.

>The error is in the semi-fixed part of flex.  Yep, if I could figure out 
>what is wrong with the code, or anyone could, then it might be possible to 
>use awk or patch even to put in the repaired code.
>
>The reason one uses flex is to essentially build the state machine 
>offline, and to build it as efficiently as possible.  But theoretically, 
>it should be possible to do this all dynamically - essentially, with a 
>good enough implementation of regular expression compilation, well, as 
>long as you can reduce your matching to a *single* regular expression, no 
>matter how complex, and you have a good way of dispatching actions, there 
>is no reason not to use regular expressions.  The "boost" regular 
>expression library is probably nearly as good as flex.
>
>(Hmmm...in demime, I match regular expressions sequentially.  I have 
>groups of regular expressions that are applied, and I should probably 
>combine them.  By putting each expression in a non-grouping parenthesis, 
>and "or"ing them, I should be able to apply them all at once.  I should 
>determine if it is faster to do it that way.)
>
>>>I have two approaches.  One is that I should be calling 
>>>yy_switch_to_buffer from within the rule rather than from outside.  I 
>>>will try adding the code to the processing of "From".
>>>
>>>If that fails, then I will work on forcing in EOFs and moving detection 
>>>of From, mime boundaries and header ends to the code outside of the lexer.
>>>
>>>Feeding the lexer artificial EOFs at the end of every section is 
>>>probably clean enough to work unconditionally.
>>
>>Artificial EOF's sound reasonable and I suspect they would work.
>>
>>Returning to the master/slave design, another approach would be to have 
>>the master separate out each mime body part (with decoding) and pass that 
>>(as a buffer) to the slave.  It would use additional memory, but it 
>>should be much simpler than all the complicated buffer contortions you're 
>>encountering.
>
>I believe this is just another way of saying "artificial EOF".
>
>I guess my point is that it *should* be possible to do either.  What we 
>should do is this:
>
>You should work on combining all the buffers into one with states.
>
>I should work on artificial EOFs, and moving the part detection up and out.
>
>We should try them out against each other.  I reserve the right to be 
>wrong...but the trials should happen with the non-blocking plain text parser.
>
>We should work up a performance suite.  It should be representative of the 
>actual things that people do with bogofilter.  I believe that mailbox 
>processing should be only a small part of the process - most of it should 
>be parsing single messages, even large single messages with lots of 
>sections, but they should be representative of the actual messages that 
>people are getting.  I would be willing to, say, start with a fixed 
>database, take, say, 10-15 representative messages, parse them 100 times 
>each, and use the total elapsed and user time as an indicator.
>
>Once this is done, different approaches, parameters, and so forth, can be 
>compared. The numbers I am throwing around may well be more representative 
>of a pathological case than actual processing.
>
>-=-=-=-=-=-=-=-=-
>After writing the stuff above, other than the allusions of "it was an 
>accident it ever worked", I finally figured out the bug.
>
>This is the problem, as I see it:
>
>1.  yyinput was being called from within flex.  There was a bunch of state 
>up in the air when it was called.  The issue that it had was that there 
>was a buffer that needed more input, for some reason.  So we were handed a 
>buffer (actually an offset into a buffer).
>
>2.  We started filling that buffer.  We actually started altering that 
>buffer.  That buffer was passed from routine to routine and all sorts of 
>things.

We're acting under orders.  The flex code called our routine, i.e. 
yyinput(), because it wanted more data and we're the supplier.  The buffer 
is passed around because we need to get the _right_ data.  Depending on 
circumstances, that'll be raw input, decoded text, or yyredo text (for 
example for tagging header tokens and for block_on_subnet).

>3.  At that point, outside of the lexer, we started looking at the 
>input.  One example is that we looked to see if there was a "From " in the 
>input.  That was not first done in the lexer, as it seemed, that match is 
>simply never made, it is a waste of time in there.  (I think it was left 
>there to confuse me).

As described elsewhere, the check for "From " is _necessary_ and I have 
test cases to demonstrate the need for it.  FWIW, the match _is_ made.  Try 
setting a breakpoint at is_from() and you'll see that it gets called twice 
for each message separator - once from lgetsl() in lexer.c and once from 
the "^From " pattern of the current lexer.

>4.  Well, we have been handed buffer A from, say, the plain text lexer, 
>and that is the one we are putting stuff into. But -- whoops...we have 
>just discovered, *by putting the data into the buffer* that we will be 
>switching lexers.  With the data partially in the buffer, but with the 
>count unreported to Flex, we tell Flex that we want to switch 
>buffers...whoops!  That was what/why I saw the processing overlaying part 
>of the data in the buffer.

Sorry, but flex isn't told anything.  The state is known only in token.c 
and isn't used until the _next_ call to get_token().  The incoming data 
goes into the buffer provided by flex and is returned to flex which parses 
a token and returns to get_token().  When get_token() is next called by 
bogofilter, the new lexer is used.

The flaw in this sequence is that the remainder of the message separator 
isn't parsed until bogofilter starts using the lexer that was active when 
the message separator.

I'll modify the state changing code so that the change happens at the end 
of the line, which should correct _that_ problem with lexer switching.  Of 
course, I question the value of doing that since the unified lexer doesn't 
have the problem.

>5.  Now we return.  Holy cow are things hosed - we made a call with one 
>buffer and got the return with a *different* buffer active.  We have some 
>stuff that is in our stack, so we might make more updates to that buffer 
>we have stolen.

You may have discovered something relevant, I'm not sure.  However, you 
misunderstand some of the intricacies of what's happening.

First, there's the issue of "^From " detection.  Since the lexer is 
caseless, its rule also recognizes "^fRoM ".  The extra checks, i.e. 
"msg_header && is_from(yytext), in the lexers are there for two 
reasons:  (1) to verify proper case, and (2) that we're in a message 
header.  We've seen QP encoded text where a line begins "=46rom" which 
decodes to "From".  The special check keeps us from interpreting this as a 
message separator.

Second, lexer.c uses is_from() on the raw input line.  This is needed so 
that bogofilter knows when to set its header state.  The text is put where 
the current flex code wants it, nowhere else.  Since all 3 lexers check for 
message separators, it doesn't matter which lexer buffer gets the line - 
the message separator will be recognized.

Third, token.c's states (LEXER_HEAD, LEXER_TEXT, LEXER_HTML) determine 
which lexer is called.  Changing the state "in the middle of a call" 
doesn't affect the current call.  It affects the next call.

By the way, there _is_ a flaw it the changing state code.  Assume that the 
plain text lexer is running when a message separator is detected.  The 
state gets changed to LEXER_HEAD and the next call to get_token() uses the 
head lexer for the next line.  The remainder of the message separator is 
left with the plain text lexer until the plain text lexer is next 
used.  When this happens, bogofilter will include the leftover tokens in 
the wrong message.  This leads to an incorrect count for the token.

>This should never have worked.  One way or the other we are ripping 
>something out from under something else.

Sorry, there is no "ripping out from under" :-(

>I made the first crack at fixing this - I created a second buffer in 
>yyinput and had the input built there.  Then I  save it and return it on 
>next read, forcing an EOF, if there has been a lexer change.  But that is 
>the wrong thing to do.  There are still things in the air, and there are 
>still things that will lace.
>
>Perhaps the simplest thing would be for the lexer swap to be deferred 
>until after the lexer returns.  In fact, I will try that next.  That has 
>the best chance of anything to work --- just postpone the lexer swap until 
>the lexer returns.

With multiple lexers, the state change _should_ await the end of the 
line.  The problem goes away with a single lexer as there're are no states 
to worry about.

Regarding performance, I have no objection to your continuing work on 
buffer switching.  Performance improvements are always welcome.  The 
current trio of lexers are, I now believe, _not_ the right division of 
labor.  The fact that they duplicate some patterns is a bad 
thing.  Yesterday I fixed a problem with the BOUNDARY pattern, which (at 
some point) had been corrected in lexer_head.l, but not in 
lexer_text_plain.l or in lexer_text_html.l.  Keeping them all synchronized 
is something of a problem.

Anyhow, to get back on topic, when both lexers are working we can run some 
performance tests.

David