Performance of plain text parser, one potential error in pattern..
Nick Simicich
njs at scifi.squawk.com
Sat Feb 22 02:48:18 CET 2003
At 09:42 AM 2003-02-21 -0500, David Relson wrote:
>Nick,
>
>I'll take a look at your lexer changes.
>
>The backslash/bracket changes should be easy to test.
>
>Deleting trailing periods from tokens sounds like the way it should
>be. If we're not doing that, we should change. If that changes the test
>results, that'll be fine. I've come to recognize that lexer fixes/changes
>_always_ call for updating the reference results.
We are, in fact, not including trailing punctuation, although we are
including interior punctuation. I believe that is intentional.
>I did the performance tests with Greg's large files. I can include your
>changes and see what happens. We should be able to get your command for
>hand-building lexer_text_plain.c into the Makefile.
OK, however, that candidate lexer is not final, because it is doing the
wrong thing in a couple of cases. The point is that if it is not going to
perform noticeably faster in its current slightly incorrect state, it is
not worth developing further.
>Having mucked with the lexer code, I think you're beginning to appreciate
>Matthias' plan for a main lexer for handling the header and mime
>boundaries and having it call secondary lexers for normal text and for
>html, with the secondary lexers being totally oblivious to header fields,
>mime boundaries, and "^From " lines. You're quickly becoming our lexer
>expert :-)
I have no idea whether I am becoming the lexer expert or not. I have
always wanted to work with Lex and this is such a chance.
I have another idea for working with over size tokens which involves using
states, and never passing the tokens out - it should be possible to detect
them by using patterns, and by using states to indicate that you are inside
the giant tag. Then you just pluck of bits of the giant tag, small enough
to handle easily (like 8k or so) and throw them away. In fact, if you have
the right patterns, you should be able to avoid ever passing an oversize
token into the higher level tokenizer program. You can throw large
patterns away in the matcher, hopefully in linear time.
The point is that the patterns and the adjunct escape patterns will be
different if we want to avoid backing up. More complex, harder to deal
with. Should still be linear. I do not want to do the work on that unless
I have some hope that having a non-backing up pattern is going to perform
notably better. So that is why I want the performance results.
I really appreciate the way you are working with me on this.
--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally
to mean electronic messages designed to be read by an individual, and it
can include Usenet, SMS, AIM, etc. But if it is not all three of
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term
plays into the hands of the spammers, since it causes confusion, and
spammers thrive on confusion. Spam is not speech, it is an action, like
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!
More information about the Bogofilter
mailing list