Performance issues....and ugly news.

Sat Feb 22 22:01:00 CET 2003

David Relson <relson at osagesoftware.com> writes:

> If I understand what's going on, option "-CF" produces a batch parser
> (instead of an interactive parser) which will be faster.  Currently
> bogofilter uses one parser for message headers (including those of mime
> parts), and separate parser for plain text and html in the message body
> (or mime part body).  The read-ahead needed by the batch parser
> conflicts with using header and parser bodies because the header parser
> reads ahead and gets the first line that the body parser should see.  So
> the body parser doesn't process the line and its tokens aren't lost to
> bogiflter.

This is the issue I've also thought about for the...

> Idea 1 - it may be time to change the structure of the lexers.  The
> current trio may no longer be suitable.  Perhaps what bogofilter needs
> is a "master" parser to control what is happening.  It would handle
> header fields, mime boundary lines, etc and have a "body" mode.  When in
> body mode, it would decode incoming text (if appropriate) and pass it to
> the proper body parser (html or plain text).  When the main parser sees
> a mime boundary or the beginning of a new message, it would pass an EOF
> to the body parser and then process the following header.

...similar idea I mentioned before. The problem I currently see is that
the master parser would have to "push" data down, but lexers usually
pull. I wonder if the slave's yyinput can sanely read from the master's
buffer -- and how the scheduling would happen and be locking-free. I can
only see the slave lexer instrument the master to read data and decode
it -- not exactly a compartmentalization that we'd like to achieve with
the three parsers, and "master" would then be a misnomer. Driver might
be more appropriate.

> Idea 2 - the current lexer works adequately for the vast majority of
> messages.  What it doesn't handle satisfactorily is humongously long
> strings of characters which match a rule, hence might be a valid
> token.

I wonder if adding {1,30} or something helps. It chops long strings into
many small strings though. Tracking the state (i. e. don't return TOKEN
if the previous TOKEN wasn't separated by a delimiter) might fix this.

-- 
Matthias Andree