The joy of buffer switching....

Wed Feb 26 17:06:44 CET 2003

At 08:46 AM 2003-02-25 -0500, David Relson wrote:

>>3.  At that point, outside of the lexer, we started looking at the 
>>input.  One example is that we looked to see if there was a "From " in 
>>the input.  That was not first done in the lexer, as it seemed, that 
>>match is simply never made, it is a waste of time in there.  (I think it 
>>was left there to confuse me).
>
>As described elsewhere, the check for "From " is _necessary_ and I have 
>test cases to demonstrate the need for it.  FWIW, the match _is_ 
>made.  Try setting a breakpoint at is_from() and you'll see that it gets 
>called twice for each message separator - once from lgetsl() in lexer.c 
>and once from the "^From " pattern of the current lexer.

Actually, I looked at that and I do not see it being called at all.  Maybe 
that is part of the issue as to why I am in b,b state and I do not expect 
them to be.

>>4.  Well, we have been handed buffer A from, say, the plain text lexer, 
>>and that is the one we are putting stuff into. But -- whoops...we have 
>>just discovered, *by putting the data into the buffer* that we will be 
>>switching lexers.  With the data partially in the buffer, but with the 
>>count unreported to Flex, we tell Flex that we want to switch 
>>buffers...whoops!  That was what/why I saw the processing overlaying part 
>>of the data in the buffer.
>
>Sorry, but flex isn't told anything.  The state is known only in token.c 
>and isn't used until the _next_ call to get_token().  The incoming data 
>goes into the buffer provided by flex and is returned to flex which parses 
>a token and returns to get_token().  When get_token() is next called by 
>bogofilter, the new lexer is used.

I am really tired, I did something else last night.  Part of the issue has 
to do with the way I was swapping buffers and when.  I understand the 
deferred state better now.

>The flaw in this sequence is that the remainder of the message separator 
>isn't parsed until bogofilter starts using the lexer that was active when 
>the message separator.
>
>I'll modify the state changing code so that the change happens at the end 
>of the line, which should correct _that_ problem with lexer switching.  Of 
>course, I question the value of doing that since the unified lexer doesn't 
>have the problem.
>
>>5.  Now we return.  Holy cow are things hosed - we made a call with one 
>>buffer and got the return with a *different* buffer active.  We have some 
>>stuff that is in our stack, so we might make more updates to that buffer 
>>we have stolen.
>
>You may have discovered something relevant, I'm not sure.  However, you 
>misunderstand some of the intricacies of what's happening.
>
>First, there's the issue of "^From " detection.  Since the lexer is 
>caseless, its rule also recognizes "^fRoM ".  The extra checks, i.e. 
>"msg_header && is_from(yytext), in the lexers are there for two 
>reasons:  (1) to verify proper case, and (2) that we're in a message 
>header.  We've seen QP encoded text where a line begins "=46rom" which 
>decodes to "From".  The special check keeps us from interpreting this as a 
>message separator.

Again, I do not see the From in, say, the Plain Text parser, as being 
relevant.  The From is always detected from the input routine.  If it is 
detected again in the parser, I saw it as a problem.

>Second, lexer.c uses is_from() on the raw input line.  This is needed so 
>that bogofilter knows when to set its header state.  The text is put where 
>the current flex code wants it, nowhere else.  Since all 3 lexers check 
>for message separators, it doesn't matter which lexer buffer gets the line 
>- the message separator will be recognized.
>
>Third, token.c's states (LEXER_HEAD, LEXER_TEXT, LEXER_HTML) determine 
>which lexer is called.  Changing the state "in the middle of a call" 
>doesn't affect the current call.  It affects the next call.

Except in my code, I was using that to move the buffer.

>By the way, there _is_ a flaw it the changing state code.  Assume that the 
>plain text lexer is running when a message separator is detected.  The 
>state gets changed to LEXER_HEAD and the next call to get_token() uses the 
>head lexer for the next line.  The remainder of the message separator is 
>left with the plain text lexer until the plain text lexer is next 
>used.  When this happens, bogofilter will include the leftover tokens in 
>the wrong message.  This leads to an incorrect count for the token.

Which is what I was trying to fix by dragging the buffer out of the head 
lexer and installing it into the plain text lexer.  The buffer structure 
should work.

>>This should never have worked.  One way or the other we are ripping 
>>something out from under something else.
>
>Sorry, there is no "ripping out from under" :-(

The problem you discovered was the one I have been working on all along, 
essentially.  The whole point of trying to get -CF to work is, essentially 
driving the, moving of buffers around.

If you think that the current lexer should be allowed to finish parsing the 
current line as opposed to the current token, that is doable.  Is that what 
you actually mean?

>>I made the first crack at fixing this - I created a second buffer in 
>>yyinput and had the input built there.  Then I  save it and return it on 
>>next read, forcing an EOF, if there has been a lexer change.  But that is 
>>the wrong thing to do.  There are still things in the air, and there are 
>>still things that will lace.
>>
>>Perhaps the simplest thing would be for the lexer swap to be deferred 
>>until after the lexer returns.  In fact, I will try that next.  That has 
>>the best chance of anything to work --- just postpone the lexer swap 
>>until the lexer returns.
>
>With multiple lexers, the state change _should_ await the end of the line.

Should it await the end of the line or simply the end of the token?

>   The problem goes away with a single lexer as there're are no states to 
> worry about.
>
>Regarding performance, I have no objection to your continuing work on 
>buffer switching.  Performance improvements are always welcome.  The 
>current trio of lexers are, I now believe, _not_ the right division of 
>labor.  The fact that they duplicate some patterns is a bad 
>thing.  Yesterday I fixed a problem with the BOUNDARY pattern, which (at 
>some point) had been corrected in lexer_head.l, but not in 
>lexer_text_plain.l or in lexer_text_html.l.  Keeping them all synchronized 
>is something of a problem.
>
>Anyhow, to get back on topic, when both lexers are working we can run some 
>performance tests.
>
>David
>
>
>
>---------------------------------------------------------------------
>FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
>To unsubscribe, e-mail: bogofilter-dev-unsubscribe at aotto.com
>For summary digest subscription: bogofilter-dev-digest-subscribe at aotto.com
>For more commands, e-mail: bogofilter-dev-help at aotto.com
>
>
>

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!