Performance of plain text parser, one potential error in pattern..

Sat Feb 22 03:08:25 CET 2003

At 08:48 PM 2/21/03, Nick Simicich wrote:
>At 09:42 AM 2003-02-21 -0500, David Relson wrote:
>
>>Nick,
>>
>>I'll take a look at your lexer changes.
>>
>>The backslash/bracket changes should be easy to test.
>>
>>Deleting trailing periods from tokens sounds like the way it should 
>>be.  If we're not doing that, we should change.  If that changes the test 
>>results, that'll be fine.  I've come to recognize that lexer 
>>fixes/changes _always_ call for updating the reference results.
>
>We are, in fact, not including trailing punctuation, although we are 
>including interior punctuation.  I believe that is intentional.
>
>>I did the performance tests with Greg's large files.  I can include your 
>>changes and see what happens.  We should be able to get your command for 
>>hand-building lexer_text_plain.c into the Makefile.
>
>OK, however, that candidate lexer is not final, because it is doing the 
>wrong thing in a couple of cases.  The point is that if it is not going to 
>perform noticeably faster in its current slightly incorrect state, it is 
>not worth developing further.
>
>>Having mucked with the lexer code, I think you're beginning to appreciate 
>>Matthias' plan for a main lexer for handling the header and mime 
>>boundaries and having it call secondary lexers for normal text and for 
>>html, with the secondary lexers being totally oblivious to header fields, 
>>mime boundaries, and "^From " lines.  You're quickly becoming our lexer 
>>expert :-)
>
>I have no idea whether I am becoming the lexer expert or not.  I have 
>always wanted to work with Lex and this is such a chance.

FWIW, up through approx 0.7.5, TOKEN was basically 
[leter][letter/digit]*.  At about that time the pattern changed to include 
some punctuation, but not all - which gave us the complex pattern we now 
know.  ''Tis great that you're now getting to fulfill that life long 
ambition :-)

>I have another idea for working with over size tokens which involves using 
>states, and never passing the tokens out - it should be possible to detect 
>them by using patterns, and by using states to indicate that you are 
>inside the giant tag.  Then you just pluck of bits of the giant tag, small 
>enough to handle easily (like 8k or so) and throw them away.  In fact, if 
>you have the right patterns, you should be able to avoid ever passing an 
>oversize token into the higher level tokenizer program.  You can throw 
>large patterns away in the matcher, hopefully in linear time.
>
>The point is that the patterns and the adjunct escape patterns will be 
>different if we want to avoid backing up.  More complex, harder to deal 
>with.  Should still be linear.  I do not want to do the work on that 
>unless I have some hope that having a non-backing up pattern is going to 
>perform notably better.  So that is why I want the performance results.
>
>I really appreciate the way you are working with me on this.

The appreciation is mutual.  Also, thanks for the explanation.  I was gonna 
spend some time to tweak the TOKEN/T2/T1 lexer so that it would pass "make 
check" before running performance tests.  Now I realize that that is bass 
ackwards.   So, I'll just run the timing tests.  By the way, I can send you 
a copy of the 3 test files if you want them.