lexer differences [was: The joy of buffer switching....]
David Relson
relson at osagesoftware.com
Thu Feb 27 19:45:52 CET 2003
Hi Nick,
At 09:33 AM 2/27/03, Nick Simicich wrote:
>At 11:46 AM 2003-02-26 -0500, David Relson wrote:
>
>>>Actually, I looked at that and I do not see it being called at
>>>all. Maybe that is part of the issue as to why I am in b,b state and I
>>>do not expect them to be.
>>
>>Sounds like you've got something broken...
>
>The fact that your output matches my output means that your bugs probably
>match my bugs. Do t.lexer.mbx with -v and look at the state transitions.
The output of lexer_v3 doesn't match the output of the head/text/html lexer
trio. It's close but slightly different. Having examined the differences,
I think that they're caused by lexer_v3 doing a couple of things right that
the trio didn't do properly. I deem the new output to be "good".
Your output matches lexer_v3, except for the two differences in handling of
non-breaking html tags. That's a good thing. I'm just looking for input
on whether the differences are a good thing or a bad thing. We'll change
the code according to the good/bad judgements.
Are there bugs in the lexer? Very possibly there are, but I haven't yet
found them and nobody else has either.
FWIW, I'm referring to tokens output by the lexer in the above
comments. The lexer trio wasn't quite right with the head/body flags
(states), though I think the unified lexer is correct. Certainly I see
is_from() and got_from() being called and transitions from body to head
mode when From is encountered. Your statements that got_from() is never
called is very puzzling.
... [snip] ...
>>Ideally, neither the plain text nor the html parser should know about
>>From. However when they're in control, there needs to be a way to say
>>"return to header mode", which is a task that the low-level check for
>>From helps with. However, by the time the low-level code sees the From,
>>it's too late to say "I should be using the header lexer on this
>>line". Of course with a single lexer, this problem goes away, because no
>>buffer/lexer switching is needed.
>
>The point is that the right thing to do, in my opinion, is to just return
>a zero length and let EOF processing happen. When EOF is done, you can
>just, I believe, call the lexer again. It resets the buffer to ask for
>more input -- it only forces things through EOF once.
As I've said before, we're smarter than the lexer and if we need to tell it
there's an EOF, we should do that.
>>Processing of the From line needs to be completed before processing
>>begins on the next line. In bogofilter 0.10.x, the lexer switch leaves
>>the old lexer with a partially processed line. Odds are that bogofilter
>>will again use that old lexer - at which time processing of the partial
>>line will complete. This can result in a token belonging to one message
>>being processed as part of a later message. I believe I sent you the
>>file (bradyn.mbx) that demonstrates the problem.
>
>Well, there is a call that says, "Please flush the buffer (ignore any
>additional input)".
>
>Again, forcing an EOF is the right thing to do. I tested that code
>(forced in an EOF after state change, defer passing down the line that
>causes the state change) in combination with some other stuff. The bugs I
>had were not related to that processing.
Again I say, please send me a tarball of your src directory. I want to
test the same code you have, not code that may be different because I fixed
something you have. Given the code you have, I can better tell what's
happening.
>>I'm thinking it should wait until the end of the line. However that's
>>tricky as it's necessary to deal with folded lines. When I got to that
>>bit of complexity yesterday, I said "S**** it" and released my unified
>>lexer. (I didn't remove the 3 part lexer from cvs, so it can be easily
>>resurrected by changing the makefile).
>
>I am not sure why it is a good idea to process the From line with the
>plain text processor. I thought about this, and think the right thing to
>do is to pass in an EOF and copy the buffer.
Ideally, From isn't seen or recognized by any lexer except the head
lexer. However, I don't presently know enough to remove "From "
recognition from the plain text and html text lexers. You're way ahead of
me in lexer knowledge.
Please _do_ implement the EOF so we can see if the code is better (cleaner)
and faster.
Thanks.
David
More information about the bogofilter-dev
mailing list