Performance issues....and ugly news.

Sat Feb 22 20:38:34 CET 2003

Nick,

If I understand what's going on, option "-CF" produces a batch parser 
(instead of an interactive parser) which will be faster.  Currently 
bogofilter uses one parser for message headers (including those of mime 
parts), and separate parser for plain text and html in the message body (or 
mime part body).  The read-ahead needed by the batch parser conflicts with 
using header and parser bodies because the header parser reads ahead and 
gets the first line that the body parser should see.  So the body parser 
doesn't process the line and its tokens aren't lost to bogiflter.

Idea 1 - it may be time to change the structure of the lexers.  The current 
trio may no longer be suitable.  Perhaps what bogofilter needs is a 
"master" parser to control what is happening.  It would handle header 
fields, mime boundary lines, etc and have a "body" mode.  When in body 
mode, it would decode incoming text (if appropriate) and pass it to the 
proper body parser (html or plain text).  When the main parser sees a mime 
boundary or the beginning of a new message, it would pass an EOF to the 
body parser and then process the following header.

Idea 2 - the current lexer works adequately for the vast majority of 
messages.  What it doesn't handle satisfactorily is humongously long 
strings of characters which match a rule, hence might be a valid 
token.  Since we know that bogofilter is going to ignore strings longer 
than MAXTOKENLEN, we can include code discard these strings.  Rather than 
let the lexer use lots of time to match something we don't want, we should 
do the discard sooner (rather than later).

True, having a maximum acceptable token length will cause bogofilter to 
ignore very long valid words.  However we are doing that anyway and nobody 
has complained.  Very long words are unusual so if bogofilter doesn't rate 
them, it loses little in its efforts to classify spam.

Question:  Can a maximum length be built into the lexer rules?  What if we 
used 100 characters as the maximum?

More later ...

David