The joy of buffer switching....

Wed Feb 26 17:46:33 CET 2003

Good morning Nick!

At 11:06 AM 2/26/03, Nick Simicich wrote:
>At 08:46 AM 2003-02-25 -0500, David Relson wrote:
>
>>>3.  At that point, outside of the lexer, we started looking at the 
>>>input.  One example is that we looked to see if there was a "From " in 
>>>the input.  That was not first done in the lexer, as it seemed, that 
>>>match is simply never made, it is a waste of time in there.  (I think it 
>>>was left there to confuse me).
>>
>>As described elsewhere, the check for "From " is _necessary_ and I have 
>>test cases to demonstrate the need for it.  FWIW, the match _is_ 
>>made.  Try setting a breakpoint at is_from() and you'll see that it gets 
>>called twice for each message separator - once from lgetsl() in lexer.c 
>>and once from the "^From " pattern of the current lexer.
>
>Actually, I looked at that and I do not see it being called at all.  Maybe 
>that is part of the issue as to why I am in b,b state and I do not expect 
>them to be.

Sounds like you've got something broken...

>>>4.  Well, we have been handed buffer A from, say, the plain text lexer, 
>>>and that is the one we are putting stuff into. But -- whoops...we have 
>>>just discovered, *by putting the data into the buffer* that we will be 
>>>switching lexers.  With the data partially in the buffer, but with the 
>>>count unreported to Flex, we tell Flex that we want to switch 
>>>buffers...whoops!  That was what/why I saw the processing overlaying 
>>>part of the data in the buffer.
>>
>>Sorry, but flex isn't told anything.  The state is known only in token.c 
>>and isn't used until the _next_ call to get_token().  The incoming data 
>>goes into the buffer provided by flex and is returned to flex which 
>>parses a token and returns to get_token().  When get_token() is next 
>>called by bogofilter, the new lexer is used.
>
>I am really tired, I did something else last night.  Part of the issue has 
>to do with the way I was swapping buffers and when.  I understand the 
>deferred state better now.

I understand.  I sometimes get really fatigued when something's not working 
and I don't understand why and the evidence seems contradictory...

>>The flaw in this sequence is that the remainder of the message separator 
>>isn't parsed until bogofilter starts using the lexer that was active when 
>>the message separator.
>>
>>I'll modify the state changing code so that the change happens at the end 
>>of the line, which should correct _that_ problem with lexer 
>>switching.  Of course, I question the value of doing that since the 
>>unified lexer doesn't have the problem.
>>
>>>5.  Now we return.  Holy cow are things hosed - we made a call with one 
>>>buffer and got the return with a *different* buffer active.  We have 
>>>some stuff that is in our stack, so we might make more updates to that 
>>>buffer we have stolen.
>>
>>You may have discovered something relevant, I'm not sure.  However, you 
>>misunderstand some of the intricacies of what's happening.
>>
>>First, there's the issue of "^From " detection.  Since the lexer is 
>>caseless, its rule also recognizes "^fRoM ".  The extra checks, i.e. 
>>"msg_header && is_from(yytext), in the lexers are there for two 
>>reasons:  (1) to verify proper case, and (2) that we're in a message 
>>header.  We've seen QP encoded text where a line begins "=46rom" which 
>>decodes to "From".  The special check keeps us from interpreting this as 
>>a message separator.
>
>Again, I do not see the From in, say, the Plain Text parser, as being 
>relevant.  The From is always detected from the input routine.  If it is 
>detected again in the parser, I saw it as a problem.

Ideally, neither the plain text nor the html parser should know about 
From.  However when they're in control, there needs to be a way to say 
"return to header mode", which is a task that the low-level check for From 
helps with.  However, by the time the low-level code sees the From, it's 
too late to say "I should be using the header lexer on this line".  Of 
course with a single lexer, this problem goes away, because no buffer/lexer 
switching is needed.

>>Second, lexer.c uses is_from() on the raw input line.  This is needed so 
>>that bogofilter knows when to set its header state.  The text is put 
>>where the current flex code wants it, nowhere else.  Since all 3 lexers 
>>check for message separators, it doesn't matter which lexer buffer gets 
>>the line - the message separator will be recognized.
>>
>>Third, token.c's states (LEXER_HEAD, LEXER_TEXT, LEXER_HTML) determine 
>>which lexer is called.  Changing the state "in the middle of a call" 
>>doesn't affect the current call.  It affects the next call.
>
>Except in my code, I was using that to move the buffer.
>
>>By the way, there _is_ a flaw it the changing state code.  Assume that 
>>the plain text lexer is running when a message separator is 
>>detected.  The state gets changed to LEXER_HEAD and the next call to 
>>get_token() uses the head lexer for the next line.  The remainder of the 
>>message separator is left with the plain text lexer until the plain text 
>>lexer is next used.  When this happens, bogofilter will include the 
>>leftover tokens in the wrong message.  This leads to an incorrect count 
>>for the token.
>
>Which is what I was trying to fix by dragging the buffer out of the head 
>lexer and installing it into the plain text lexer.  The buffer structure 
>should work.

As I found yesterday, both your buffer switching code and my unified lexer 
work properly.  Buffer switching adds complexity which I don't 
like.  However if it leads to a faster, better parser I'll learn to accept 
the complexity.  In time, I may come to love it :-)

>>>This should never have worked.  One way or the other we are ripping 
>>>something out from under something else.
>>
>>Sorry, there is no "ripping out from under" :-(
>
>The problem you discovered was the one I have been working on all along, 
>essentially.  The whole point of trying to get -CF to work is, essentially 
>driving the, moving of buffers around.
>
>If you think that the current lexer should be allowed to finish parsing 
>the current line as opposed to the current token, that is doable.  Is that 
>what you actually mean?

Processing of the From line needs to be completed before processing begins 
on the next line.  In bogofilter 0.10.x, the lexer switch leaves the old 
lexer with a partially processed line.  Odds are that bogofilter will again 
use that old lexer - at which time processing of the partial line will 
complete.  This can result in a token belonging to one message being 
processed as part of a later message.  I believe I sent you the file 
(bradyn.mbx) that demonstrates the problem.

>>>I made the first crack at fixing this - I created a second buffer in 
>>>yyinput and had the input built there.  Then I  save it and return it on 
>>>next read, forcing an EOF, if there has been a lexer change.  But that 
>>>is the wrong thing to do.  There are still things in the air, and there 
>>>are still things that will lace.
>>>
>>>Perhaps the simplest thing would be for the lexer swap to be deferred 
>>>until after the lexer returns.  In fact, I will try that next.  That has 
>>>the best chance of anything to work --- just postpone the lexer swap 
>>>until the lexer returns.
>>
>>With multiple lexers, the state change _should_ await the end of the line.
>
>Should it await the end of the line or simply the end of the token?

I'm thinking it should wait until the end of the line.  However that's 
tricky as it's necessary to deal with folded lines.  When I got to that bit 
of complexity yesterday, I said "S**** it" and released my unified 
lexer.  (I didn't remove the 3 part lexer from cvs, so it can be easily 
resurrected by changing the makefile).

David