Performance issues....and ugly news.
Nick Simicich
njs at scifi.squawk.com
Sat Feb 22 13:56:12 CET 2003
Summary: In order to do *any* optimization of the generated scanners, you
have to generate batch scanners. The scheme of multiple scanners has a
hidden dependency on interactive scanners. I will explain. There are
multiple ugly solutions.
-=-=-=-
David Relson sent me the files that he used to run his testing programs. I
built my new parser with my new options, the original parser with the
original options, and the original plain text parser with the new options only.
I made three runs. One is reproduced below. These numbers are typical. To
regularize the environment, I built a new goodlist.db and spamlist.db
before each run with one word in each database. You can see the procedures
I used below.
Run 1 was done with my new parser. Run two was done with the original
parser using the original options. Run three was done with the original
parser, but the new Flex options.
user 0m34.486s - user time from my parser (no backing up) and the new
options.
user 1m13.369s - Original user numbers, my platform (1 Gz P III, to get
a base).
user 0m40.061s - User time, old parser, new options (specifying -cF).
The answer is that there are large savings available when the flex parser
is built with -CF. So, I said, well, what is the next step? Obviously,
building 0.10.3, the current stable version, with -CF. I was looking
forward to saving a bunch of cpu time on my personal machine, with just
this one quick parameter change! I changed the following lines in the
Makefile (crudely, there is obviously a better way to do it). (The LEX
variable seems to be unused).
lexer_head.c: lexer_head.l
flex -CF -Plexer_ -o$@ $<
lexer_text_plain.c: lexer_text_plain.l
flex -CF -Ptext_plain_ -o$@ $<
lexer_text_html.c: lexer_text_html.l
flex -CF -Ptext_html_ -o$@ $<
Then I did a:
touch *.l
make && make check
make check does not run. t.lexer fails. The problem is as follows:
[root at scifi tests]# cat inputs/t.lexer.inp
From nobody at nowhere.com
Subject: Stephen King's latest thriller! Free Shipping & Handling
Red Rabbit By Tom Clancy
Tom Clancy's back with the story behind his most popular
character - Jack Ryan.
Leadership By Rudolph Giuliani
The controversial mayor who guided New York City through the
biggest crisis in the city's history reflects on his time in
office and the lessons he learned there.
Subject: Rent all the DVDs you want -- Free Trial!
The Standard Program is a month-to-month subscription cancelable at
anytime. Click the "Your Account" button for cancellation
instructions. No refunds or credits for partial monthly subscription
periods. Please visit www.netflix.com for complete terms and
conditions, including length of free trial period. Netflix reserves
the right to change terms and conditions at any time. (c)2002 JunkFlix,
Inc.
[root at scifi tests]# diff -u ./outputs/t.lexer.out ./lexer.20030222/t.lexer.out
--- ./outputs/t.lexer.out Thu Jan 30 13:07:22 2003
+++ ./lexer.20030222/t.lexer.out Sat Feb 22 05:13:02 2003
@@ -13,7 +13,6 @@
character
city
city's
-clancy
clancy's
click
complete
@@ -56,8 +55,6 @@
please
popular
program
-rabbit
-red
reflects
refunds
rent
All of these missing tokens are from one line: The first line that is the
one following the header. This failure was from my Redhat 5.0 system. I
verified that it fails in exactly the same manner on my RedHat 7.2 system.
I have tried tracing the code, and I do not see where the line is lost. I
see it finding the newline, doing the newline processing, and then the next
thing it does is it is working on the second line of the body.
I wonder if this is part of the issue that affects my code.
Dave, I want to toss this over the wall to you. The question is, "Can you
tell what the problem is when the lexers are built with -CF?"
My question was, "Which part of the -CF alterations to the lexer are
causing us problems?"
I tried building the lexer without -CF, and of course it worked.
I tried building it with -B, and it failed. The only optimization that is
done with -B is to do "lookahead". That is, the lexer is less worried
about how much text it has absorbed, because the -B means "batch". The
normal lexer is built with -I which means "interactive". The Interactive
lexer reads as little text from the input stream as possible. (By the way,
this will not always happen -- but it will usually happen).
So, what has happened? The man page says, regarding -I:
>`-I'
>instructs flex to generate an interactive scanner. An interactive scanner is
>one that only looks ahead to decide what token has been matched if it
>absolutely must. It turns out that always looking one extra character ahead,
>even if the scanner has already seen enough text to disambiguate the current
>token, is a bit faster than only looking ahead when necessary. But scanners
>that always look ahead give dreadful interactive performance; for example,
>when a user types a newline, it is not recognized as a newline token until
>they enter another token, which often means typing in another whole line.
That is what is happening here. There are three scanners. The first one
has "absorbed text". The text will not be available until you are looking
at a header somewhere.
This is worse than it sounds. Any optimization of any scanners requires,
sort of as a first step, that the scanner be non-interactive. No fixup of
any sort can be done until this is fixed.
I can't find anything in the documentation that says, "Tell me what is in
them buffers." And maybe give them back.
Can you? (anyone?)
This is probably at least part of the issue with my random errors.
I can see two possibilities. One is to hack the scanner to somehow move
the partially eaten buffers from one queue to another.
The other is to extract the data from the partially eaten buffers when
switching scanners and to put it back into the input stream.
The third is to do all of the splitting of the input from outside the
scanner. You know that you have reached an empty line. You do not keep
feeding stuff to the header scanner. You feed it an end-of-file and force
it to jam.
You have to do all scanning for mime separators in the input
routine. Again, when you reach an end of a section, you feed the scanner a
EOF and jam it.
And you have to scan for "From ".
This is not as expensive as you might think. I believe that the outer
scanner is already breaking things into lines. If you are processing
multiple messages, you have to check the first character for F. If you are
running a mime message with separators, you need to check for -. And if
you are in a header, you need to check for '\n'.
If you want to maintain multiple scanners, you need to do the above. The
alternative, which would not be that bad, would be to use states and
combine all scanners. The problem there is that the simpler scanners, the
plain text scanner definitely, and maybe the header scanner (less
important) would be able to be recoded as no-backing-up scanners. I am
fairly sure that the html scanner can't be coded that way.
We can see the performance gains: The worst case is the 73 cpu seconds to
parse the miserable large message. The optimization to a batch scanner
with expanded tables gets us to 40 seconds.
Built-in optimization is about 54% (save 46%).
Eliminating backing up allows us to get to 35 seconds. The backing up is
another 14%. (47% of 113 seconds).
So, we can probably make the scanner run about twice as fast, just by
combining all of the lexers into one and moving the state changes to the
Lexer where they would be represented by lex states so that the right
patterns are selected.
But, if we keep the lexers separate, we can make the text lexer maybe
7%/14% (of the total or the higher number) faster by eliminating
backups....but we will lose some of that because boundary detection and
From line detection and header end detection will have to move to the
higher level routine that will need to do that stuff?
Simple?
Of course, if it is possible to move any partial buffers from one lexer to
the other, then all of the problems go away. We can use the current
structures....
I thought, "Why don't we have the different scanners use the same anchor
for the buffers?"
My head hurts and I have to stop thinking about this for a while.
YY_CURRENT_BUFFER holds the new buffer. It is an item of YY_BUFFER_STATE.
When you get the newline, maybe you stash YY_CURRENT_BUFFER somewhere,
maybe you pass it to the other scanner with yy_switch_to_buffer in that
scanner and so forth.
At first I thought that the buffer would be local to the scanner. I do not
think it is.
If no one else works on this today, I will try and think about it tonight.
[njs at glock bogofilter3]$ ./time.123.new.sh
+ date
Sat Feb 22 04:51:40 EST 2003
+ '[' -x ./bogofilter/src/bogofilter_new ']'
+ '[' -d /tmp/bogofilter_work ']'
+ rm -rf /tmp/bogofilter_work
+ mkdir /tmp/bogofilter_work
+ echo 'ham 1 20030222'
+ ./bogofilter/src/bogoutil -l /tmp/bogofilter_work/goodlist.db
+ echo 'spam 1 20030222'
+ ./bogofilter/src/bogoutil -l /tmp/bogofilter_work/spamlist.db
+ echo ./bogofilter/src/bogofilter_new 2.txt
./bogofilter/src/bogofilter_new 2.txt
+ ./bogofilter/src/bogofilter_new -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000,
version=0.10.3.1.cvs.20030220
real 0m1.585s
user 0m1.539s
sys 0m0.033s
+ echo ''
+ echo ./bogofilter/src/bogofilter_new 3.txt
./bogofilter/src/bogofilter_new 3.txt
+ ./bogofilter/src/bogofilter_new -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000,
version=0.10.3.1.cvs.20030220
real 0m0.698s
user 0m0.684s
sys 0m0.006s
+ echo ''
+ echo ./bogofilter/src/bogofilter_new 4.txt
./bogofilter/src/bogofilter_new 4.txt
+ ./bogofilter/src/bogofilter_new -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000,
version=0.10.3.1.cvs.20030220
real 0m35.502s
user 0m34.486s
sys 0m0.139s
+ echo ''
+ '[' -x ./bogofilter/src/bogofilter_old ']'
+ '[' -d /tmp/bogofilter_work ']'
+ rm -rf /tmp/bogofilter_work
+ mkdir /tmp/bogofilter_work
+ echo 'ham 1 20030222'
+ ./bogofilter/src/bogoutil -l /tmp/bogofilter_work/goodlist.db
+ echo 'spam 1 20030222'
+ ./bogofilter/src/bogoutil -l /tmp/bogofilter_work/spamlist.db
+ echo ./bogofilter/src/bogofilter_old 2.txt
./bogofilter/src/bogofilter_old 2.txt
+ ./bogofilter/src/bogofilter_old -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000,
version=0.10.3.1.cvs.20030220
real 0m4.199s
user 0m4.105s
sys 0m0.047s
+ echo ''
+ echo ./bogofilter/src/bogofilter_old 3.txt
./bogofilter/src/bogofilter_old 3.txt
+ ./bogofilter/src/bogofilter_old -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000,
version=0.10.3.1.cvs.20030220
real 0m2.334s
user 0m2.277s
sys 0m0.021s
+ echo ''
+ echo ./bogofilter/src/bogofilter_old 4.txt
./bogofilter/src/bogofilter_old 4.txt
+ ./bogofilter/src/bogofilter_old -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000,
version=0.10.3.1.cvs.20030220
real 1m14.532s
user 1m13.369s
sys 0m0.176s
+ echo ''
+ '[' -x ./bogofilter/src/bogofilter ']'
+ '[' -d /tmp/bogofilter_work ']'
+ rm -rf /tmp/bogofilter_work
+ mkdir /tmp/bogofilter_work
+ echo 'ham 1 20030222'
+ ./bogofilter/src/bogoutil -l /tmp/bogofilter_work/goodlist.db
+ echo 'spam 1 20030222'
+ ./bogofilter/src/bogoutil -l /tmp/bogofilter_work/spamlist.db
+ echo ./bogofilter/src/bogofilter 2.txt
./bogofilter/src/bogofilter 2.txt
+ ./bogofilter/src/bogofilter -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000,
version=0.10.3.1.cvs.20030220
real 0m2.037s
user 0m2.002s
sys 0m0.021s
+ echo ''
+ echo ./bogofilter/src/bogofilter 3.txt
./bogofilter/src/bogofilter 3.txt
+ ./bogofilter/src/bogofilter -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000,
version=0.10.3.1.cvs.20030220
real 0m0.968s
user 0m0.945s
sys 0m0.016s
+ echo ''
+ echo ./bogofilter/src/bogofilter 4.txt
./bogofilter/src/bogofilter 4.txt
+ ./bogofilter/src/bogofilter -v -C -d /tmp/bogofilter_work
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000,
version=0.10.3.1.cvs.20030220
real 0m41.982s
user 0m40.061s
sys 0m0.102s
+ echo ''
[njs at glock bogofilter3]$
--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally
to mean electronic messages designed to be read by an individual, and it
can include Usenet, SMS, AIM, etc. But if it is not all three of
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term
plays into the hands of the spammers, since it causes confusion, and
spammers thrive on confusion. Spam is not speech, it is an action, like
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!
More information about the Bogofilter
mailing list