Performance of plain text parser, one potential error in pattern..

Nick Simicich njs at scifi.squawk.com
Sat Feb 22 02:48:18 CET 2003


At 09:42 AM 2003-02-21 -0500, David Relson wrote:

>Nick,
>
>I'll take a look at your lexer changes.
>
>The backslash/bracket changes should be easy to test.
>
>Deleting trailing periods from tokens sounds like the way it should 
>be.  If we're not doing that, we should change.  If that changes the test 
>results, that'll be fine.  I've come to recognize that lexer fixes/changes 
>_always_ call for updating the reference results.

We are, in fact, not including trailing punctuation, although we are 
including interior punctuation.  I believe that is intentional.

>I did the performance tests with Greg's large files.  I can include your 
>changes and see what happens.  We should be able to get your command for 
>hand-building lexer_text_plain.c into the Makefile.

OK, however, that candidate lexer is not final, because it is doing the 
wrong thing in a couple of cases.  The point is that if it is not going to 
perform noticeably faster in its current slightly incorrect state, it is 
not worth developing further.

>Having mucked with the lexer code, I think you're beginning to appreciate 
>Matthias' plan for a main lexer for handling the header and mime 
>boundaries and having it call secondary lexers for normal text and for 
>html, with the secondary lexers being totally oblivious to header fields, 
>mime boundaries, and "^From " lines.  You're quickly becoming our lexer 
>expert :-)

I have no idea whether I am becoming the lexer expert or not.  I have 
always wanted to work with Lex and this is such a chance.

I have another idea for working with over size tokens which involves using 
states, and never passing the tokens out - it should be possible to detect 
them by using patterns, and by using states to indicate that you are inside 
the giant tag.  Then you just pluck of bits of the giant tag, small enough 
to handle easily (like 8k or so) and throw them away.  In fact, if you have 
the right patterns, you should be able to avoid ever passing an oversize 
token into the higher level tokenizer program.  You can throw large 
patterns away in the matcher, hopefully in linear time.

The point is that the patterns and the adjunct escape patterns will be 
different if we want to avoid backing up.  More complex, harder to deal 
with.  Should still be linear.  I do not want to do the work on that unless 
I have some hope that having a non-backing up pattern is going to perform 
notably better.  So that is why I want the performance results.

I really appreciate the way you are working with me on this.

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!



More information about the Bogofilter mailing list