Performance issues....and ugly news.
David Relson
relson at osagesoftware.com
Sat Feb 22 20:38:34 CET 2003
Nick,
If I understand what's going on, option "-CF" produces a batch parser
(instead of an interactive parser) which will be faster. Currently
bogofilter uses one parser for message headers (including those of mime
parts), and separate parser for plain text and html in the message body (or
mime part body). The read-ahead needed by the batch parser conflicts with
using header and parser bodies because the header parser reads ahead and
gets the first line that the body parser should see. So the body parser
doesn't process the line and its tokens aren't lost to bogiflter.
Idea 1 - it may be time to change the structure of the lexers. The current
trio may no longer be suitable. Perhaps what bogofilter needs is a
"master" parser to control what is happening. It would handle header
fields, mime boundary lines, etc and have a "body" mode. When in body
mode, it would decode incoming text (if appropriate) and pass it to the
proper body parser (html or plain text). When the main parser sees a mime
boundary or the beginning of a new message, it would pass an EOF to the
body parser and then process the following header.
Idea 2 - the current lexer works adequately for the vast majority of
messages. What it doesn't handle satisfactorily is humongously long
strings of characters which match a rule, hence might be a valid
token. Since we know that bogofilter is going to ignore strings longer
than MAXTOKENLEN, we can include code discard these strings. Rather than
let the lexer use lots of time to match something we don't want, we should
do the discard sooner (rather than later).
True, having a maximum acceptable token length will cause bogofilter to
ignore very long valid words. However we are doing that anyway and nobody
has complained. Very long words are unusual so if bogofilter doesn't rate
them, it loses little in its efforts to classify spam.
Question: Can a maximum length be built into the lexer rules? What if we
used 100 characters as the maximum?
More later ...
David
More information about the bogofilter-dev
mailing list