Performance issues....and ugly news.
Matthias Andree
matthias.andree at gmx.de
Sat Feb 22 22:01:00 CET 2003
David Relson <relson at osagesoftware.com> writes:
> If I understand what's going on, option "-CF" produces a batch parser
> (instead of an interactive parser) which will be faster. Currently
> bogofilter uses one parser for message headers (including those of mime
> parts), and separate parser for plain text and html in the message body
> (or mime part body). The read-ahead needed by the batch parser
> conflicts with using header and parser bodies because the header parser
> reads ahead and gets the first line that the body parser should see. So
> the body parser doesn't process the line and its tokens aren't lost to
> bogiflter.
This is the issue I've also thought about for the...
> Idea 1 - it may be time to change the structure of the lexers. The
> current trio may no longer be suitable. Perhaps what bogofilter needs
> is a "master" parser to control what is happening. It would handle
> header fields, mime boundary lines, etc and have a "body" mode. When in
> body mode, it would decode incoming text (if appropriate) and pass it to
> the proper body parser (html or plain text). When the main parser sees
> a mime boundary or the beginning of a new message, it would pass an EOF
> to the body parser and then process the following header.
...similar idea I mentioned before. The problem I currently see is that
the master parser would have to "push" data down, but lexers usually
pull. I wonder if the slave's yyinput can sanely read from the master's
buffer -- and how the scheduling would happen and be locking-free. I can
only see the slave lexer instrument the master to read data and decode
it -- not exactly a compartmentalization that we'd like to achieve with
the three parsers, and "master" would then be a misnomer. Driver might
be more appropriate.
> Idea 2 - the current lexer works adequately for the vast majority of
> messages. What it doesn't handle satisfactorily is humongously long
> strings of characters which match a rule, hence might be a valid
> token.
I wonder if adding {1,30} or something helps. It chops long strings into
many small strings though. Tracking the state (i. e. don't return TOKEN
if the previous TOKEN wasn't separated by a delimiter) might fix this.
--
Matthias Andree
More information about the bogofilter-dev
mailing list