Performance of plain text parser, one potential error in pattern..

Nick Simicich njs at scifi.squawk.com
Fri Feb 21 14:11:22 CET 2003


I am trying to alter the plain text flex parser to make it 
faster.  Yesterday's messages seemed to indicate that it might be worth 
doing work to tune the plain text parser.  I read through the explanation 
about how to make things faster in the lex man page, and they put some 
emphasis on eliminating "backing up".  So I tried reworking the parser.  I 
managed to alter the parser to eliminate backing up, and all the basic 
tests are passing, but the advanced tests are not passing.  I built a 
bogolexer that uses the original lexer_text_plain.l and another one that 
uses my lexer_text_plain.l.   I noted a couple of things:

The sequence:

TOKEN 
[^[:blank:][:cntrl:][:digit:][:punct:]][^][:blank:]<>;=():&%$#@!+|/\\{}^\"\?\*,[:cntrl:][]+[^[:blank:][:punct:][:cntrl:]]

is used in a number of places (I think all the parsers use this to define a 
basic token). The point is to parse out three character tokens with a rule 
for the first character, the middle characters, and the last character. 
That is not only in lexer_text_plain.l, but in the other lexers. The 
intention is to not allow the characters [ and ] in the second thru n-1th 
characters of the token, I think.  This was not coded correctly, in my 
opinion, and may happen to work, but if you make any alterations to the 
pattern, it barfs.  I think that the intention was to code the following:

TOKEN 
[^[:blank:][:cntrl:][:digit:][:punct:]][^\][:blank:]<>;=():&%$#@!+|/\\{}^\"\?\*,[:cntrl:]\[]+[^[:blank:][:punct:][:cntrl:]]

Note that the [ and ] characters are backslashed.  That is the difference, 
and it allows you to alter the pattern more easily.

Now why is it important to change this?  As I mentioned, I am trying to 
allow flex to generate a more efficient parser by eliminating backing 
up.  You do this by adding rules that allow it to dispose of any partially 
matched token at any point.  For example, the ipaddr changes to

{IPADDR}/[^0-9.]                                { return(IPADDR);}
[0-9.]+                                         ;

Which says, specifically, "If you have something that exactly matches the 
current ipaddress pattern followed by something that is not a digit or a 
dot, well, you actually have an ipaddress (this is more precise than the 
current definition) while if you happen to have some other random mix of 
digits and dots, well, that is not something we care about (we currently do 
not save or process numeric tokens, they are thrown away with the ". ; rule 
that is currently last. (Unless, of course, they exactly match the 
ipaddress token.)  This means that the lex parser can determine that the 
second pattern is a superset of the first pattern, such that it will always 
apply to any proper subset of the first pattern, so that you have an "escape".

If all of your patterns have escapes, then flex can use a different 
technique that allows it to build a more efficient parser, or such is the 
claim.  That is what I am hoping someone will measure.

Anyway, in order to keep from having to backtrack in and around the token, 
processing, you make the following changes:

TOKEN 
^[:blank:][:cntrl:][:digit:][:punct:]][^\][:blank:]<>;=():&%$#@!+|/\\{}^\"\?\*,[:cntrl:]\[]+[^[:blank:][:punct:][:cntrl:]]
T2 
[^[:blank:][:cntrl:][:digit:][:punct:]][^\][:blank:]<>;=():&%$#@!+|/\\{}^\"\?\*,[:cntrl:]\[]+
T1      [^[:blank:][:cntrl:][:digit:][:punct:]]

Later, you code:

{TOKEN}                                         { return(TOKEN); }
{T2}                                            ...new rule....
{T1}                                            ;

(The {T1} is not necessary in our case, because there is a . rule that is 
used to dispose of single characters of unneeded text.)  The point is to 
always give the parser something to do with a "thing" it has started 
parsing, without making it "back up".

Now, after some hours of work (most of the trials were involved in trying 
to build the right patterns to work with the mime boundaries), I managed to 
put together a ruleset that did not back up, for parsing plain text. 
However, it did not correctly parse the test cases.  You see, the rule 
threw away all tokens that ended in a ".", that is, all words at the end of 
sentences.  I started writing a simple rule, then I realized that it would 
do the wrong thing in the case of a word with ellipsis.... and a space 
following...without the space, the word/punctuation pair would be parsed as 
a single token.  I wrote an action for the rule, that would allow me to 
peel off the trailing characters and push them back onto the processing 
stack while retaining the valid characters.  I believe that is working.

In any case, I coded an action around T2 that passes the basic tests.  It 
does not pass the advanced tests.   I believe that it is not properly 
switching into and out of header mode. (Boy, this is a major pain in the 
butt.  I am starting to sympathize with the view that separating a mailbox 
into separate files should be done in a different piece of code - there is 
a lot of complexity involved in reinitializing...and, frankly, I somehow 
disabled that code, and I simply can't figure out how I did it.

I do not believe that there is a problem in my code -- I think that there 
is some other problem that my code exposes.

As an example:  There is a test file called "good.mbx".

Altering the original lexer_text_plain.l to add a line of the form:
  fprintf(stderr,"Found from %d/%s,%d,%d\n", yyleng,yytext, msg_header, 
is_from(&yy_text));

to the action for the line that matches the "^From[ \t]" pattern, gives the

Found from 5/From ,1,1

which repeats 30 times.  Somehow the msg_header variable is being reset.

If I change to my version of the lexer, it produces the following line:

Found from 5/From ,0,1

Which repeats 47 times.

[njs at glock inputs]$ grep '^From ' good.mbx | wc -l
      48
[njs at glock inputs]$

This simply does not make too much sense to me.  I am right with the 
original code as I parse the file until I get to

--- /tmp/good.out.orig  Fri Feb 21 05:28:21 2003
+++ /tmp/good.out.alter Fri Feb 21 06:07:44 2003
@@ -410,7 +410,11 @@
  get_token: 1 'wall`
  get_token: 1 'maa00756`
  get_token: 1 'wall.org`
-get_token: 2 'from`
+get_token: 1 'from`
+get_token: 1 'glamb`
+get_token: 1 'dynagen.on.ca`
+get_token: 1 'fri`
+get_token: 1 'nov`
  get_token: 1 'return-path`
  get_token: 1 'glamb`
  get_token: 1 'dynagen.on.ca`

and this is from the "^From " line of the second included mail file.

so, somehow, as far as I can tell, I am parsing right along with you until 
we start the *second* message, and somehow I do not manage to get back into 
the right header.

Now, there is a variable that is tested from this line called 
msg_header.  It seems to be the general "I am in a header" variable.

It is set from mime_reset.

mime_reset is called from the routines that are invoked when you find a 
closing mime barrier.  It is also called from token.c when "got_from()" is 
called.   But it looks like it is being checked *before* it is being 
called, which simply does not make any sense.

But I can't figure how it is getting called -- I know that when I call it 
right out of the action, that my parsing tracks yours for the first several 
thousand lines.  Then it diverges around a line with some leading dashes 
that could be a mime separator but is not - my code retains the leading 
dashes on the token.  I can probably fix that.

I have tried dealing with the issues.  At this point, I would be interested 
if this makes any difference in the oversize e-mail processing.  If someone 
wants to test this,

1.  Pick it up from here:

http://majordomo.squawk.com/njs/demime/lexer_text_plain.l

2. Do a make.  Then hand-build the lexer with these options.

  flex -cF -v -p -b -Ptext_plain_ -olexer_text_plain.c lexer_text_plain.l

3. Then do another make.

Things should be all right if you are processing single mail files, but I 
would not use this for production work.  I am interested in seeing if the 
people who did the performance runs with the large files would repeat them 
with this version of lexer_text_plain, flexed with the above options.  I am 
interested in knowing if it really runs faster than the old version.  The 
most important option is the -cF option.  -v, -p and -b are basically 
"reporting" options about the build. The -cF tells Flex not to compress the 
tables, which is what allows it to build a parser that takes advantage of 
the fact that the ruleset does not have any backing up.

By the way, I see no way to tune the html parser in this manner because of 
the lookahead, although that may be my incomplete understanding of the 
issues.  Another approach might be to add states to this processor when one 
is in a large token, to simply skip a large token while breaking it into 
managable pieces.  I believe that there is an underlying maximum size for 
the tokens, and it might be worthwhile adding some states and a way to 
rapidly step through huge tokens without saving them.  But let's see if 
this gives us any performance improvement first.  There needs to be a base 
that has a TOKEN that can be altered before any other mods are done.

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!



More information about the Bogofilter mailing list