The joy of buffer switching....

Tue Feb 25 06:23:28 CET 2003

At 11:02 AM 2003-02-24 -0500, David Relson wrote:

>As I read the documentation on flex's support for multiple buffers, I see 
>something different.  Flex works with the first buffer, then with the 
>second, then returns to the first _at_the_same_place_ as before.  For 
>bogofilter, we want to process the buffer starting at a _different_ place.

No, we are resuming the buffer in the same place at which it was 
suspended.  This is supposed to be something that works.  I see nothing in 
the world that says, "it is evil to move buffers around," but conversely, 
the buffer structures are exactly the same (I have checked) and I simply do 
not see the difference between suspending the buffer by doing a 
yy_switch_to_buffer and what they do....hmmmmm.

>Given that we want to use one buffer (for batch efficiency), flex limits 
>us to one parser.

No.  We need to code without errors.  There is a fundamental error in the 
way that the programs are coded, which I finally found...I think.

>Of course, the current bogofilter has 3 lexers and 3 buffers.  The big 
>difference is that we are using flex in a line by line mode (interactive) 
>rather than a batch mode.

No, it is, as I suspected, an accident that it ever worked.

>There are at least two reasons to convert back to a single lexer.  First, 
>we want better head/body control in our parsing.  Using flex start 
>conditions, i.e. states, that can be done.  Second, we want the improved 
>efficiency of batch mode.
>
>I'm working on a single lexer _with_ states.  With luck I'll have it 
>working in a day or two.  Then we can get back to the issue of performance.

Well, the problem is that we can't get the absolute best performance with a 
combined parser because of "backing up".  The entire parser has to be free 
of "backing up states" before you can eliminate the backing up code and 
checks from the parser.  Please let me have one more whack at forcing in 
eofs so that the separate parsers have a chance.

Unless you are sure about how to eliminate backing up with the revised html 
parser, I believe that the separate parsers will perform better....

>>Doing the yy_switch_to_buffer() is supposed to take the variables that 
>>are up in the air inside the lexer and stash them inside the buffer's 
>>state variables. But it just is not working.  I spent a while traceing 
>>this, watching it go wrong, until I realized that the yy_switch_to_buffer 
>>was hosing the buffer.]
>
>As a thought, if indeed the yy_switch_to_buffer() code (or other C code) 
>is defective, there's a chance we could correct it and submit a patch to 
>flex.  Given that thought, I poked around to find the origins of that 
>code.  Running "strings /usr/bin/flex" it appears that the code is built 
>into the executable.  If we developed a patch to fix buffer switching, the 
>patch would have to applied after flex converts the .l file to a .c 
>file.  This approach would be further complicated by supporting different 
>versions of flex (and/or lex) on different versions of linux and of the 
>other operating systems on which bogofilter runs.  I don't think this 
>approach would be successful.

The way that flex works is that there are two parts, essentially:  The sort 
of semi-fixed part is the generated code and the large tables that each 
character is looked up in, and which either result in an action, or in 
another state which can then be looked up.  This is why it is essentially a 
linear action to have a very large number of patterns.  The more patterns 
there are, the more work goes in to building the tables, but then once that 
is done, a good implementation should be able to be matched against each of 
the characters, pretty much in linear time.  I have a program which matches 
a complex regular expression against the mail log in real time, searching 
for mail sent to spamtraps, so that the ip addresses of the transmitting 
sites can be added to a blocklist.  (this is maintained as 
blocklist4.squawk.com, and regularly contains between 700 and 1000 
different sites that have tried to mail a spamtrap address at squawk.com 
within the last week.  A total of 9,121 addresses have ever been in this 
blocklist since I started keeping stats (that is the size of the table that 
is keyed by ip address).

The error is in the semi-fixed part of flex.  Yep, if I could figure out 
what is wrong with the code, or anyone could, then it might be possible to 
use awk or patch even to put in the repaired code.

The reason one uses flex is to essentially build the state machine offline, 
and to build it as efficiently as possible.  But theoretically, it should 
be possible to do this all dynamically - essentially, with a good enough 
implementation of regular expression compilation, well, as long as you can 
reduce your matching to a *single* regular expression, no matter how 
complex, and you have a good way of dispatching actions, there is no reason 
not to use regular expressions.  The "boost" regular expression library is 
probably nearly as good as flex.

(Hmmm...in demime, I match regular expressions sequentially.  I have groups 
of regular expressions that are applied, and I should probably combine 
them.  By putting each expression in a non-grouping parenthesis, and 
"or"ing them, I should be able to apply them all at once.  I should 
determine if it is faster to do it that way.)

>>I have two approaches.  One is that I should be calling 
>>yy_switch_to_buffer from within the rule rather than from outside.  I 
>>will try adding the code to the processing of "From".
>>
>>If that fails, then I will work on forcing in EOFs and moving detection 
>>of From, mime boundaries and header ends to the code outside of the lexer.
>>
>>Feeding the lexer artificial EOFs at the end of every section is probably 
>>clean enough to work unconditionally.
>
>Artificial EOF's sound reasonable and I suspect they would work.
>
>Returning to the master/slave design, another approach would be to have 
>the master separate out each mime body part (with decoding) and pass that 
>(as a buffer) to the slave.  It would use additional memory, but it should 
>be much simpler than all the complicated buffer contortions you're 
>encountering.

I believe this is just another way of saying "artificial EOF".

I guess my point is that it *should* be possible to do either.  What we 
should do is this:

You should work on combining all the buffers into one with states.

I should work on artificial EOFs, and moving the part detection up and out.

We should try them out against each other.  I reserve the right to be 
wrong...but the trials should happen with the non-blocking plain text parser.

We should work up a performance suite.  It should be representative of the 
actual things that people do with bogofilter.  I believe that mailbox 
processing should be only a small part of the process - most of it should 
be parsing single messages, even large single messages with lots of 
sections, but they should be representative of the actual messages that 
people are getting.  I would be willing to, say, start with a fixed 
database, take, say, 10-15 representative messages, parse them 100 times 
each, and use the total elapsed and user time as an indicator.

Once this is done, different approaches, parameters, and so forth, can be 
compared. The numbers I am throwing around may well be more representative 
of a pathological case than actual processing.

-=-=-=-=-=-=-=-=-
After writing the stuff above, other than the allusions of "it was an 
accident it ever worked", I finally figured out the bug.

This is the problem, as I see it:

1.  yyinput was being called from within flex.  There was a bunch of state 
up in the air when it was called.  The issue that it had was that there was 
a buffer that needed more input, for some reason.  So we were handed a 
buffer (actually an offset into a buffer).

2.  We started filling that buffer.  We actually started altering that 
buffer.  That buffer was passed from routine to routine and all sorts of 
things.

3.  At that point, outside of the lexer, we started looking at the 
input.  One example is that we looked to see if there was a "From " in the 
input.  That was not first done in the lexer, as it seemed, that match is 
simply never made, it is a waste of time in there.  (I think it was left 
there to confuse me).

4.  Well, we have been handed buffer A from, say, the plain text lexer, and 
that is the one we are putting stuff into. But -- whoops...we have just 
discovered, *by putting the data into the buffer* that we will be switching 
lexers.  With the data partially in the buffer, but with the count 
unreported to Flex, we tell Flex that we want to switch 
buffers...whoops!  That was what/why I saw the processing overlaying part 
of the data in the buffer.

5.  Now we return.  Holy cow are things hosed - we made a call with one 
buffer and got the return with a *different* buffer active.  We have some 
stuff that is in our stack, so we might make more updates to that buffer we 
have stolen.

This should never have worked.  One way or the other we are ripping 
something out from under something else.

I made the first crack at fixing this - I created a second buffer in 
yyinput and had the input built there.  Then I  save it and return it on 
next read, forcing an EOF, if there has been a lexer change.  But that is 
the wrong thing to do.  There are still things in the air, and there are 
still things that will lace.

Perhaps the simplest thing would be for the lexer swap to be deferred until 
after the lexer returns.  In fact, I will try that next.  That has the best 
chance of anything to work --- just postpone the lexer swap until the lexer 
returns.

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!