lexer differences [was: The joy of buffer switching....]

Thu Feb 27 19:45:52 CET 2003

Hi Nick,

At 09:33 AM 2/27/03, Nick Simicich wrote:
>At 11:46 AM 2003-02-26 -0500, David Relson wrote:
>
>>>Actually, I looked at that and I do not see it being called at 
>>>all.  Maybe that is part of the issue as to why I am in b,b state and I 
>>>do not expect them to be.
>>
>>Sounds like you've got something broken...
>
>The fact that your output matches my output means that your bugs probably 
>match my bugs.  Do t.lexer.mbx with -v and look at the state transitions.

The output of lexer_v3 doesn't match the output of the head/text/html lexer 
trio.  It's close but slightly different.  Having examined the differences, 
I think that they're caused by lexer_v3 doing a couple of things right that 
the trio didn't do properly.  I deem the new output to be "good".

Your output matches lexer_v3, except for the two differences in handling of 
non-breaking html tags.  That's a good thing.  I'm just looking for input 
on whether the differences are a good thing or a bad thing.  We'll change 
the code according to the good/bad judgements.

Are there bugs in the lexer?  Very possibly there are, but I haven't yet 
found them and nobody else has either.

FWIW, I'm referring to tokens output by the lexer in the above 
comments.  The lexer trio wasn't quite right with the head/body flags 
(states), though I think the unified lexer is correct.  Certainly I see 
is_from() and got_from() being called and transitions from body to head 
mode when From is encountered.  Your statements that got_from() is never 
called is very puzzling.

... [snip] ...

>>Ideally, neither the plain text nor the html parser should know about 
>>From.  However when they're in control, there needs to be a way to say 
>>"return to header mode", which is a task that the low-level check for 
 >>From helps with.  However, by the time the low-level code sees the From, 
>>it's too late to say "I should be using the header lexer on this 
>>line".  Of course with a single lexer, this problem goes away, because no 
>>buffer/lexer switching is needed.
>
>The point is that the right thing to do, in my opinion, is to just return 
>a zero length and let EOF processing happen.  When EOF is done, you can 
>just, I believe, call the lexer again.  It resets the buffer to ask for 
>more input -- it only forces things through EOF once.

As I've said before, we're smarter than the lexer and if we need to tell it 
there's an EOF, we should do that.

>>Processing of the From line needs to be completed before processing 
>>begins on the next line.  In bogofilter 0.10.x, the lexer switch leaves 
>>the old lexer with a partially processed line.  Odds are that bogofilter 
>>will again use that old lexer - at which time processing of the partial 
>>line will complete.  This can result in a token belonging to one message 
>>being processed as part of a later message.  I believe I sent you the 
>>file (bradyn.mbx) that demonstrates the problem.
>
>Well, there is a call that says, "Please flush the buffer (ignore any 
>additional input)".
>
>Again, forcing an EOF is the right thing to do.  I tested that code 
>(forced in an EOF after state change, defer passing down the line that 
>causes the state change) in combination with some other stuff.  The bugs I 
>had were not related to that processing.

Again I say, please send me a tarball of your src directory.  I want to 
test the same code you have, not code that may be different because I fixed 
something you have.  Given the code you have, I can better tell what's 
happening.

>>I'm thinking it should wait until the end of the line.  However that's 
>>tricky as it's necessary to deal with folded lines.  When I got to that 
>>bit of complexity yesterday, I said "S**** it" and released my unified 
>>lexer.  (I didn't remove the 3 part lexer from cvs, so it can be easily 
>>resurrected by changing the makefile).
>
>I am not sure why it is a good idea to process the From line with the 
>plain text processor.  I thought about this, and think the right thing to 
>do is to pass in an EOF and copy the buffer.

Ideally, From isn't seen or recognized by any lexer except the head 
lexer.  However, I don't presently know enough to remove "From " 
recognition from the plain text and html text lexers.  You're way ahead of 
me in lexer knowledge.

Please _do_ implement the EOF so we can see if the code is better (cleaner) 
and faster.

Thanks.

David