Performance issues....and ugly news.

Nick Simicich njs at scifi.squawk.com
Sat Feb 22 13:56:12 CET 2003


Summary:  In order to do *any* optimization of the generated scanners, you 
have to generate batch scanners.  The scheme of multiple scanners has a 
hidden dependency on interactive scanners.  I will explain.  There are 
multiple ugly solutions.

-=-=-=-

David Relson sent me the files that he used to run his testing programs.  I 
built my new parser with my new options, the original parser with the 
original options, and the original plain text parser with the new options only.

I made three runs.  One is reproduced below. These numbers are typical.  To 
regularize the environment, I built a new goodlist.db and spamlist.db 
before each run with one word in each database.  You can see the procedures 
I used below.

Run 1 was done with my new parser.  Run two was done with the original 
parser using the original options.  Run three was done with the original 
parser, but the new Flex options.

user    0m34.486s - user time from my parser (no backing up) and the new 
options.
user    1m13.369s - Original user numbers, my platform (1 Gz P III, to get 
a base).
user    0m40.061s - User time, old parser, new options (specifying -cF).

The answer is that there are large savings available when the flex parser 
is built with -CF.  So, I said, well, what is the next step?  Obviously, 
building 0.10.3, the current stable version, with -CF. I was looking 
forward to saving a bunch of cpu time on my personal machine, with just 
this one quick parameter change! I changed the following lines in the 
Makefile (crudely, there is obviously a better way to do it).  (The LEX 
variable seems to be unused).

lexer_head.c:   lexer_head.l
         flex -CF -Plexer_ -o$@ $<

lexer_text_plain.c:     lexer_text_plain.l
         flex -CF -Ptext_plain_ -o$@ $<

lexer_text_html.c:      lexer_text_html.l
         flex -CF -Ptext_html_ -o$@ $<

Then I did a:

touch *.l
make && make check

make check does not run.  t.lexer fails. The problem is as follows:

[root at scifi tests]# cat inputs/t.lexer.inp
 From nobody at nowhere.com
Subject: Stephen King's latest thriller! Free Shipping & Handling

Red Rabbit By Tom Clancy
Tom Clancy's back with the story behind his most popular
character - Jack Ryan.

Leadership By Rudolph Giuliani
The controversial mayor who guided New York City through the
biggest crisis in the city's history reflects on his time in
office and the lessons he learned there.

Subject: Rent all the DVDs you want -- Free Trial!

The Standard Program is a month-to-month subscription cancelable at
anytime. Click the "Your Account" button for cancellation
instructions. No refunds or credits for partial monthly subscription
periods. Please visit www.netflix.com for complete terms and
conditions, including length of free trial period. Netflix reserves
the right to change terms and conditions at any time. (c)2002 JunkFlix,
Inc.
[root at scifi tests]# diff -u ./outputs/t.lexer.out ./lexer.20030222/t.lexer.out
--- ./outputs/t.lexer.out       Thu Jan 30 13:07:22 2003
+++ ./lexer.20030222/t.lexer.out        Sat Feb 22 05:13:02 2003
@@ -13,7 +13,6 @@
  character
  city
  city's
-clancy
  clancy's
  click
  complete
@@ -56,8 +55,6 @@
  please
  popular
  program
-rabbit
-red
  reflects
  refunds
  rent

All of these missing tokens are from one line: The first line that is the 
one following the header.  This failure was from my Redhat 5.0 system.  I 
verified that it fails in exactly the same manner on my RedHat 7.2 system.

I have tried tracing the code, and I do not see where the line is lost.  I 
see it finding the newline, doing the newline processing, and then the next 
thing it does is it is working on the second line of the body.

I wonder if this is part of the issue that affects my code.

Dave, I want to toss this over the wall to you.  The question is, "Can you 
tell what the problem is when the lexers are built with -CF?"

My question was, "Which part of the -CF alterations to the lexer are 
causing us problems?"

I tried building the lexer without -CF, and of course it worked.

I tried building it with -B, and it failed.  The only optimization that is 
done with -B is to do "lookahead".  That is, the lexer is less worried 
about how much text it has absorbed, because the -B means "batch".  The 
normal lexer is built with -I which means "interactive".  The Interactive 
lexer reads as little text from the input stream as possible.  (By the way, 
this will not always happen -- but it will usually happen).

So, what has happened?  The man page says, regarding -I:

 >`-I'
 >instructs flex to generate an interactive scanner. An interactive scanner is
 >one that only looks ahead to decide what token has been matched if it
 >absolutely must. It turns out that always looking one extra character ahead,
 >even if the scanner has already seen enough text to disambiguate the current
 >token, is a bit faster than only looking ahead when necessary. But scanners
 >that always look ahead give dreadful interactive performance; for example,
 >when a user types a newline, it is not recognized as a newline token until
 >they enter another token, which often means typing in another whole line.

That is what is happening here.  There are three scanners. The first one 
has "absorbed text".  The text will not be available until you are looking 
at a header somewhere.

This is worse than it sounds.  Any optimization of any scanners requires, 
sort of as a first step, that the scanner be non-interactive.  No fixup of 
any sort can be done until this is fixed.

I can't find anything in the documentation that says, "Tell me what is in 
them buffers."  And maybe give them back.

Can you? (anyone?)

This is probably at least part of the issue with my random errors.

I can see two possibilities.  One is to hack the scanner to somehow move 
the partially eaten buffers from one queue to another.

The other is to extract the data from the partially eaten buffers when 
switching scanners and to put it back into the input stream.

The third is to do all of the splitting of the input from outside the 
scanner.  You know that you have reached an empty line.  You do not keep 
feeding stuff to the header scanner.  You feed it an end-of-file and force 
it to jam.

You have to do all scanning for mime separators in the input 
routine.  Again, when you reach an end of a section, you feed the scanner a 
EOF and jam it.

And you have to scan for "From ".

This is not as expensive as you might think.  I believe that the outer 
scanner is already breaking things into lines.  If you are processing 
multiple messages, you have to check the first character for F.  If you are 
running a mime message with separators, you need to check for -.  And if 
you are in a header, you need to check for '\n'.

If you want to maintain multiple scanners, you need to do the above.  The 
alternative, which would not be that bad, would be to use states and 
combine all scanners.  The problem there is that the simpler scanners, the 
plain text scanner definitely, and maybe the header scanner (less 
important) would be able to be recoded as no-backing-up scanners.  I am 
fairly sure that the html scanner can't be coded that way.

We can see the performance gains: The worst case is the 73 cpu seconds to 
parse the miserable large message.  The optimization to a batch scanner 
with expanded tables gets us to 40 seconds.

Built-in optimization is about 54% (save 46%).

Eliminating backing up allows us to get to 35 seconds.  The backing up is 
another 14%. (47% of 113 seconds).

So, we can probably make the scanner run about twice as fast, just by 
combining all of the lexers into one and moving the state changes to the 
Lexer where they would be represented by lex states so that the right 
patterns are selected.

But, if we keep the lexers separate, we can make the text lexer maybe 
7%/14% (of the total or the higher number) faster by eliminating 
backups....but we will lose some of that because boundary detection and 
 From line detection and header end detection will have to move to the 
higher level routine that will need to do that stuff?

Simple?

Of course, if it is possible to move any partial buffers from one lexer to 
the other, then all of the problems go away.   We can use the current 
structures....

I thought, "Why don't we have the different scanners use the same anchor 
for the buffers?"

My head hurts and I have to stop thinking about this for a while.

YY_CURRENT_BUFFER holds the new buffer.  It is an item of YY_BUFFER_STATE.

When you get the newline, maybe you stash YY_CURRENT_BUFFER somewhere, 
maybe you pass it to the other scanner with yy_switch_to_buffer in that 
scanner and so forth.

At first I thought that the buffer would be local to the scanner.  I do not 
think it is.

If no one else works on this today, I will try and think about it tonight.

[njs at glock bogofilter3]$ ./time.123.new.sh
+ date
Sat Feb 22 04:51:40 EST 2003
+ '[' -x ./bogofilter/src/bogofilter_new ']'
+ '[' -d /tmp/bogofilter_work ']'
+ rm -rf /tmp/bogofilter_work
+ mkdir /tmp/bogofilter_work
+ echo 'ham 1 20030222'
+ ./bogofilter/src/bogoutil -l /tmp/bogofilter_work/goodlist.db
+ echo 'spam 1 20030222'
+ ./bogofilter/src/bogoutil -l /tmp/bogofilter_work/spamlist.db
+ echo ./bogofilter/src/bogofilter_new 2.txt
./bogofilter/src/bogofilter_new 2.txt
+ ./bogofilter/src/bogofilter_new -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000, 
version=0.10.3.1.cvs.20030220

real    0m1.585s
user    0m1.539s
sys     0m0.033s
+ echo ''

+ echo ./bogofilter/src/bogofilter_new 3.txt
./bogofilter/src/bogofilter_new 3.txt
+ ./bogofilter/src/bogofilter_new -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000, 
version=0.10.3.1.cvs.20030220

real    0m0.698s
user    0m0.684s
sys     0m0.006s
+ echo ''

+ echo ./bogofilter/src/bogofilter_new 4.txt
./bogofilter/src/bogofilter_new 4.txt
+ ./bogofilter/src/bogofilter_new -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000, 
version=0.10.3.1.cvs.20030220

real    0m35.502s
user    0m34.486s
sys     0m0.139s
+ echo ''

+ '[' -x ./bogofilter/src/bogofilter_old ']'
+ '[' -d /tmp/bogofilter_work ']'
+ rm -rf /tmp/bogofilter_work
+ mkdir /tmp/bogofilter_work
+ echo 'ham 1 20030222'
+ ./bogofilter/src/bogoutil -l /tmp/bogofilter_work/goodlist.db
+ echo 'spam 1 20030222'
+ ./bogofilter/src/bogoutil -l /tmp/bogofilter_work/spamlist.db
+ echo ./bogofilter/src/bogofilter_old 2.txt
./bogofilter/src/bogofilter_old 2.txt
+ ./bogofilter/src/bogofilter_old -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000, 
version=0.10.3.1.cvs.20030220

real    0m4.199s
user    0m4.105s
sys     0m0.047s
+ echo ''

+ echo ./bogofilter/src/bogofilter_old 3.txt
./bogofilter/src/bogofilter_old 3.txt
+ ./bogofilter/src/bogofilter_old -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000, 
version=0.10.3.1.cvs.20030220

real    0m2.334s
user    0m2.277s
sys     0m0.021s
+ echo ''

+ echo ./bogofilter/src/bogofilter_old 4.txt
./bogofilter/src/bogofilter_old 4.txt
+ ./bogofilter/src/bogofilter_old -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000, 
version=0.10.3.1.cvs.20030220

real    1m14.532s
user    1m13.369s
sys     0m0.176s
+ echo ''

+ '[' -x ./bogofilter/src/bogofilter ']'
+ '[' -d /tmp/bogofilter_work ']'
+ rm -rf /tmp/bogofilter_work
+ mkdir /tmp/bogofilter_work
+ echo 'ham 1 20030222'
+ ./bogofilter/src/bogoutil -l /tmp/bogofilter_work/goodlist.db
+ echo 'spam 1 20030222'
+ ./bogofilter/src/bogoutil -l /tmp/bogofilter_work/spamlist.db
+ echo ./bogofilter/src/bogofilter 2.txt
./bogofilter/src/bogofilter 2.txt
+ ./bogofilter/src/bogofilter -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000, 
version=0.10.3.1.cvs.20030220

real    0m2.037s
user    0m2.002s
sys     0m0.021s
+ echo ''

+ echo ./bogofilter/src/bogofilter 3.txt
./bogofilter/src/bogofilter 3.txt
+ ./bogofilter/src/bogofilter -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000, 
version=0.10.3.1.cvs.20030220

real    0m0.968s
user    0m0.945s
sys     0m0.016s
+ echo ''

+ echo ./bogofilter/src/bogofilter 4.txt
./bogofilter/src/bogofilter 4.txt
+ ./bogofilter/src/bogofilter -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000, 
version=0.10.3.1.cvs.20030220

real    0m41.982s
user    0m40.061s
sys     0m0.102s
+ echo ''

[njs at glock bogofilter3]$


--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!



More information about the Bogofilter mailing list