Performance issues....and ugly news.

Sun Feb 23 14:20:53 CET 2003

Nick,

Sounds like you're beginning to make significant progress.  Good deal :-)

At 10:30 PM 2/22/03, Nick Simicich wrote:
>At 02:38 PM 2003-02-22 -0500, David Relson wrote:

...[snip]...

>>Idea 1 - it may be time to change the structure of the lexers.  The 
>>current trio may no longer be suitable.  Perhaps what bogofilter needs is 
>>a "master" parser to control what is happening.  It would handle header 
>>fields, mime boundary lines, etc and have a "body" mode.  When in body 
>>mode, it would decode incoming text (if appropriate) and pass it to the 
>>proper body parser (html or plain text).  When the main parser sees a 
>>mime boundary or the beginning of a new message, it would pass an EOF to 
>>the body parser and then process the following header.
>
>No, this could not be a lexx parser, it would have to be C code.

Right.  In token.c, get_token() is currently responsible for calling the 
appropriate lexer.  The idea has been to enhance the C code so it calls the 
master parser while processing a header and for the body have the C code 
receive full lines from the master parser so that they can be passed to the 
appropriate body parser.  This seems to call for intricately careful 
orchestration of lexer calls and text handle and problem are anticipated 
for this code - if it's ever written.

>>Idea 2 - the current lexer works adequately for the vast majority of 
>>messages.  What it doesn't handle satisfactorily is humongously long 
>>strings of characters which match a rule, hence might be a valid 
>>token.  Since we know that bogofilter is going to ignore strings longer 
>>than MAXTOKENLEN, we can include code discard these strings.  Rather than 
>>let the lexer use lots of time to match something we don't want, we 
>>should do the discard sooner (rather than later).
>
>It is possible to change the lexer to do this, I was going to play with 
>this a little tonight.  However, even for non-humongously long strings, 
>the point is that we are leaving 50% of the lexer performance on the 
>table.  Check the relative performances of the other test cases.  Also, as 
>I mentioned, this is essentially broken.

My quick and simple fix, i.e. check_alphanum(), works, doesn't break 
anything, and gives good times.

>>True, having a maximum acceptable token length will cause bogofilter to 
>>ignore very long valid words.  However we are doing that anyway and 
>>nobody has complained.  Very long words are unusual so if bogofilter 
>>doesn't rate them, it loses little in its efforts to classify spam.
>>
>>Question:  Can a maximum length be built into the lexer rules?  What if 
>>we used 100 characters as the maximum?
>
>If the token rule is changed to use a
><INITIAL>{char1}{char2}{0,MAXTOKEN-2}{char3}
>as a repeat for the second character instead of a plus and then another 
>token is used that is
><INITIAL>{char1}{char2}{MAXTOKEN-1}
>   (We will have to use the numbers and not the names, I think), then the 
> end result may be that we could use that second token to switch states 
> and not pass a token back.

Am I correct in thinking that using a max length like this will result in a 
single token from the tail end of the overly long string?  I think we can 
accept that artifact though it'd be better to discard the whole thing.

>We could then use <OVERSIZESTATE>{char2}{1,8192} and if those match, we 
>could then simply
>
><INITIAL>{char1}{char2}{MAXTOKEN-1}    {BEGIN(OVERSIZESTATE);}
>
># The following avoids allowing the tokens to get "too big", since we just 
>pitch them anyway.
>
><OVERSIZESTATE>{char2}{1,YY_MAX}   ;
><OVERSIZESTATE>{[^char2]}   {yyless(1); BEGIN(0);}
>
>Char1, char2, and char3 are used to represent, roughly, the current first, 
>second, and third characters as allowed in extant tokens.
>
>All other rules would be tagged "INITIAL".  This has the advantage of 
>never passing a large token back to the upper level code.
>
>But part of the point is not to fix the flex scanner for this one whacky 
>state -- it is to fix it for all things, and optimizing it does 
>that.  This is probably a good fix no matter what (moving more of the 
>discrimination into the lexer, to a lower level).
>
>But the final point is that the buffer passing is what probably should be 
>done.  I believe that this is also why my code was not working properly 
>with finding "From " lines --- I had already eaten them in the plain text 
>lexer.

What would be the consequences of a single lexer that takes advantage of 
flex's "start condition" capability?  Can it be done?  Would it not avoid 
the need for buffer passing?  Is it not a simpler solution than 
coordinating multiple parsers?

>It is really essential that buffer handling be fixed.  I will get yet 
>another copy of the code from cvs and try to hack it to pass buffers 
>between the lexers.  I thought about it and I think it can be done.  The 
>active buffer, full or empty, needs to be extracted from the lexer you are 
>leaving and installed into the buffer you are going to.  There is already 
>code that does that, I will add it to that code and add stuff to the 
>lexers to extract and install the buffers.

'Tis no big deal to get a single module from cvs - I do it 
frequently.  There's no need to download a full copy of bogofilter.  I do 
it with a command pair like:

         mv lexer.c lexer.c.bad
         cvs update lexer.c

>As a point, I built a lexer that does this - it handles the basic tests 
>but not the advanced tests.  I ran it against 4.txt:
>
>+ echo /home/njs/prod_bogofilter/bogofilter-0.10.3.1/bogofilter 4.txt
>/home/njs/prod_bogofilter/bogofilter-0.10.3.1/bogofilter 4.txt
>+ /home/njs/prod_bogofilter/bogofilter-0.10.3.1/bogofilter -v -C -d 
>/tmp/bogofilter_work
>X-Bogosity: No, tests=bogofilter, spamicity=0.415000, version=0.10.3.1
>
>real    0m0.107s
>user    0m0.086s
>sys     0m0.012s
>+ echo ''
>
>Repeat run: user    0m0.092s
>
>That's right:  This 4.txt.  Same file, same run, same spamicity, extant 
>lexer, a few seconds later:
>
>+ echo ./bogofilter/src/bogofilter_old 4.txt
>./bogofilter/src/bogofilter_old 4.txt
>+ ./bogofilter/src/bogofilter_old -v -C -d /tmp/bogofilter_work
>X-Bogosity: Unsure, tests=bogofilter, spamicity=0.415000, 
>version=0.10.3.1.cvs.20030
>220
>
>real    1m14.109s
>user    1m13.232s
>sys     0m0.148s
>+ echo ''
>
>What is this?  It might well be "Building a lexer that handles the odd 
>case well", or it might be "building a pathological lexer to handle a 
>pathological case".  But the lexer I built throws away all that whacky 
>stuff in 0.11% of the time that first lexer did, with no 
>optimizations.  All the changes are in this part of the lexer.  I have yet 
>to figure out exactly why it
>
>%option align nounput noyywrap noreject 8bit caseless
>
>%S OVERSIZE
>
>MIME_BOUNDARY   [0-9a-zA-Z\'()+_,-./:=?]*
>
>UINT8           ([01]?[0-9]?[0-9]|2([0-4][0-9]|5[0-5]))
>IPADDR          {UINT8}\.{UINT8}\.{UINT8}\.{UINT8}
>     /* MAXTOKENLEN-1*/
>TOKEN 
>[^[:blank:][:cntrl:][:digit:][:punct:]][^\][:blank:]<>;=():&%$#@!+|/\\{}^\"\?\*,[:cntrl:]\[]{1,28}[^[:blank:][:punct:][:cntrl:]]
>MAXTOKEN 
>[^[:blank:][:cntrl:][:digit:][:punct:]][^\][:blank:]<>;=():&%$#@!+|/\\{}^\"\?\*,[:cntrl:]\[]{29}
>BIGTOKEN        [^\][:blank:]<>;=():&%$#@!+|/\\{}^\"\?\*,[:cntrl:]\[]{1,8191}
>%%
>
><INITIAL>^From\                                 { return (msg_header ? 
>FROM : TOKEN); }
><INITIAL>^--{MIME_BOUNDARY}(--)?$               { return 
>(got_mime_boundary((byte *)yytext, yyleng) ? BOUNDARY : TOKEN); }
>
><INITIAL>{IPADDR}                               { return(IPADDR);}
><INITIAL>{TOKEN}                                { return(TOKEN); }
><INITIAL>.                                      ;
><INITIAL>^\n                                    { got_newline(); 
>return(EMPTY);}
><INITIAL>\n                                     ;
>
><INITIAL>{MAXTOKEN}                             { BEGIN(OVERSIZE); }
><OVERSIZE>{BIGTOKEN}                    ;
><OVERSIZE>(.|\n)                                { yyless(yyleng); BEGIN(0); }
>
>Now, this is a broken lexer:  The problem is that it will split tokens if 
>the wrong character is mid-token, and it will also you have to insure that 
>the character following the size limited token is a geniune separate 
>token.  Making some of those changes kills the pathological performance of 
>this parser - you end up with times around 10 seconds.  Playing some more 
>gets you back to about 8 seconds.

I don't quite get what you mean by "wrong character in mid-token".  Can you 
give an example?

What does this lexer do with a medium size token, i.e. one between 29 and 
8191 characters?