Performance of plain text parser, one potential error in pattern..
David Relson
relson at osagesoftware.com
Sat Feb 22 03:08:25 CET 2003
At 08:48 PM 2/21/03, Nick Simicich wrote:
>At 09:42 AM 2003-02-21 -0500, David Relson wrote:
>
>>Nick,
>>
>>I'll take a look at your lexer changes.
>>
>>The backslash/bracket changes should be easy to test.
>>
>>Deleting trailing periods from tokens sounds like the way it should
>>be. If we're not doing that, we should change. If that changes the test
>>results, that'll be fine. I've come to recognize that lexer
>>fixes/changes _always_ call for updating the reference results.
>
>We are, in fact, not including trailing punctuation, although we are
>including interior punctuation. I believe that is intentional.
>
>>I did the performance tests with Greg's large files. I can include your
>>changes and see what happens. We should be able to get your command for
>>hand-building lexer_text_plain.c into the Makefile.
>
>OK, however, that candidate lexer is not final, because it is doing the
>wrong thing in a couple of cases. The point is that if it is not going to
>perform noticeably faster in its current slightly incorrect state, it is
>not worth developing further.
>
>>Having mucked with the lexer code, I think you're beginning to appreciate
>>Matthias' plan for a main lexer for handling the header and mime
>>boundaries and having it call secondary lexers for normal text and for
>>html, with the secondary lexers being totally oblivious to header fields,
>>mime boundaries, and "^From " lines. You're quickly becoming our lexer
>>expert :-)
>
>I have no idea whether I am becoming the lexer expert or not. I have
>always wanted to work with Lex and this is such a chance.
FWIW, up through approx 0.7.5, TOKEN was basically
[leter][letter/digit]*. At about that time the pattern changed to include
some punctuation, but not all - which gave us the complex pattern we now
know. ''Tis great that you're now getting to fulfill that life long
ambition :-)
>I have another idea for working with over size tokens which involves using
>states, and never passing the tokens out - it should be possible to detect
>them by using patterns, and by using states to indicate that you are
>inside the giant tag. Then you just pluck of bits of the giant tag, small
>enough to handle easily (like 8k or so) and throw them away. In fact, if
>you have the right patterns, you should be able to avoid ever passing an
>oversize token into the higher level tokenizer program. You can throw
>large patterns away in the matcher, hopefully in linear time.
>
>The point is that the patterns and the adjunct escape patterns will be
>different if we want to avoid backing up. More complex, harder to deal
>with. Should still be linear. I do not want to do the work on that
>unless I have some hope that having a non-backing up pattern is going to
>perform notably better. So that is why I want the performance results.
>
>I really appreciate the way you are working with me on this.
The appreciation is mutual. Also, thanks for the explanation. I was gonna
spend some time to tweak the TOKEN/T2/T1 lexer so that it would pass "make
check" before running performance tests. Now I realize that that is bass
ackwards. So, I'll just run the timing tests. By the way, I can send you
a copy of the 3 test files if you want them.
More information about the Bogofilter
mailing list