The joy of buffer switching....

Nick Simicich njs at scifi.squawk.com
Thu Feb 27 15:33:46 CET 2003


At 11:46 AM 2003-02-26 -0500, David Relson wrote:

>>Actually, I looked at that and I do not see it being called at 
>>all.  Maybe that is part of the issue as to why I am in b,b state and I 
>>do not expect them to be.
>
>Sounds like you've got something broken...

The fact that your output matches my output means that your bugs probably 
match my bugs.  Do t.lexer.mbx with -v and look at the state transitions.

>>>>4.  Well, we have been handed buffer A from, say, the plain text lexer, 
>>>>and that is the one we are putting stuff into. But -- whoops...we have 
>>>>just discovered, *by putting the data into the buffer* that we will be 
>>>>switching lexers.  With the data partially in the buffer, but with the 
>>>>count unreported to Flex, we tell Flex that we want to switch 
>>>>buffers...whoops!  That was what/why I saw the processing overlaying 
>>>>part of the data in the buffer.
>>>
>>>Sorry, but flex isn't told anything.  The state is known only in token.c 
>>>and isn't used until the _next_ call to get_token().  The incoming data 
>>>goes into the buffer provided by flex and is returned to flex which 
>>>parses a token and returns to get_token().  When get_token() is next 
>>>called by bogofilter, the new lexer is used.
>>
>>I am really tired, I did something else last night.  Part of the issue 
>>has to do with the way I was swapping buffers and when.  I understand the 
>>deferred state better now.
>
>I understand.  I sometimes get really fatigued when something's not 
>working and I don't understand why and the evidence seems contradictory...
>
>>>The flaw in this sequence is that the remainder of the message separator 
>>>isn't parsed until bogofilter starts using the lexer that was active 
>>>when the message separator.
>>>
>>>I'll modify the state changing code so that the change happens at the 
>>>end of the line, which should correct _that_ problem with lexer 
>>>switching.  Of course, I question the value of doing that since the 
>>>unified lexer doesn't have the problem.
>>>
>>>>5.  Now we return.  Holy cow are things hosed - we made a call with one 
>>>>buffer and got the return with a *different* buffer active.  We have 
>>>>some stuff that is in our stack, so we might make more updates to that 
>>>>buffer we have stolen.
>>>
>>>You may have discovered something relevant, I'm not sure.  However, you 
>>>misunderstand some of the intricacies of what's happening.
>>>
>>>First, there's the issue of "^From " detection.  Since the lexer is 
>>>caseless, its rule also recognizes "^fRoM ".  The extra checks, i.e. 
>>>"msg_header && is_from(yytext), in the lexers are there for two 
>>>reasons:  (1) to verify proper case, and (2) that we're in a message 
>>>header.  We've seen QP encoded text where a line begins "=46rom" which 
>>>decodes to "From".  The special check keeps us from interpreting this as 
>>>a message separator.
>>
>>Again, I do not see the From in, say, the Plain Text parser, as being 
>>relevant.  The From is always detected from the input routine.  If it is 
>>detected again in the parser, I saw it as a problem.
>
>Ideally, neither the plain text nor the html parser should know about 
>From.  However when they're in control, there needs to be a way to say 
>"return to header mode", which is a task that the low-level check for From 
>helps with.  However, by the time the low-level code sees the From, it's 
>too late to say "I should be using the header lexer on this line".  Of 
>course with a single lexer, this problem goes away, because no 
>buffer/lexer switching is needed.

The point is that the right thing to do, in my opinion, is to just return a 
zero length and let EOF processing happen.  When EOF is done, you can just, 
I believe, call the lexer again.  It resets the buffer to ask for more 
input -- it only forces things through EOF once.

>>>Second, lexer.c uses is_from() on the raw input line.  This is needed so 
>>>that bogofilter knows when to set its header state.  The text is put 
>>>where the current flex code wants it, nowhere else.  Since all 3 lexers 
>>>check for message separators, it doesn't matter which lexer buffer gets 
>>>the line - the message separator will be recognized.
>>>
>>>Third, token.c's states (LEXER_HEAD, LEXER_TEXT, LEXER_HTML) determine 
>>>which lexer is called.  Changing the state "in the middle of a call" 
>>>doesn't affect the current call.  It affects the next call.
>>
>>Except in my code, I was using that to move the buffer.
>>
>>>By the way, there _is_ a flaw it the changing state code.  Assume that 
>>>the plain text lexer is running when a message separator is 
>>>detected.  The state gets changed to LEXER_HEAD and the next call to 
>>>get_token() uses the head lexer for the next line.  The remainder of the 
>>>message separator is left with the plain text lexer until the plain text 
>>>lexer is next used.  When this happens, bogofilter will include the 
>>>leftover tokens in the wrong message.  This leads to an incorrect count 
>>>for the token.
>>
>>Which is what I was trying to fix by dragging the buffer out of the head 
>>lexer and installing it into the plain text lexer.  The buffer structure 
>>should work.
>
>As I found yesterday, both your buffer switching code and my unified lexer 
>work properly.  Buffer switching adds complexity which I don't 
>like.  However if it leads to a faster, better parser I'll learn to accept 
>the complexity.  In time, I may come to love it :-)
>
>>>>This should never have worked.  One way or the other we are ripping 
>>>>something out from under something else.
>>>
>>>Sorry, there is no "ripping out from under" :-(
>>
>>The problem you discovered was the one I have been working on all along, 
>>essentially.  The whole point of trying to get -CF to work is, 
>>essentially driving the, moving of buffers around.
>>
>>If you think that the current lexer should be allowed to finish parsing 
>>the current line as opposed to the current token, that is doable.  Is 
>>that what you actually mean?
>
>Processing of the From line needs to be completed before processing begins 
>on the next line.  In bogofilter 0.10.x, the lexer switch leaves the old 
>lexer with a partially processed line.  Odds are that bogofilter will 
>again use that old lexer - at which time processing of the partial line 
>will complete.  This can result in a token belonging to one message being 
>processed as part of a later message.  I believe I sent you the file 
>(bradyn.mbx) that demonstrates the problem.

Well, there is a call that says, "Please flush the buffer (ignore any 
additional input)".

Again, forcing an EOF is the right thing to do.  I tested that code (forced 
in an EOF after state change, defer passing down the line that causes the 
state change) in combination with some other stuff.  The bugs I had were 
not related to that processing.

>>>>I made the first crack at fixing this - I created a second buffer in 
>>>>yyinput and had the input built there.  Then I  save it and return it 
>>>>on next read, forcing an EOF, if there has been a lexer change.  But 
>>>>that is the wrong thing to do.  There are still things in the air, and 
>>>>there are still things that will lace.
>>>>
>>>>Perhaps the simplest thing would be for the lexer swap to be deferred 
>>>>until after the lexer returns.  In fact, I will try that next.  That 
>>>>has the best chance of anything to work --- just postpone the lexer 
>>>>swap until the lexer returns.
>>>
>>>With multiple lexers, the state change _should_ await the end of the line.
>>
>>Should it await the end of the line or simply the end of the token?
>
>I'm thinking it should wait until the end of the line.  However that's 
>tricky as it's necessary to deal with folded lines.  When I got to that 
>bit of complexity yesterday, I said "S**** it" and released my unified 
>lexer.  (I didn't remove the 3 part lexer from cvs, so it can be easily 
>resurrected by changing the makefile).

I am not sure why it is a good idea to process the From line with the plain 
text processor.  I thought about this, and think the right thing to do is 
to pass in an EOF and copy the buffer.

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!



More information about the bogofilter-dev mailing list