0.10.1.1 status report [was: the recent ^From issues]

Mon Jan 27 14:11:12 CET 2003

At 07:29 AM 1/27/03, Matthias Andree wrote:

>Hi,
>
>I have been unable to track bogofilter development last week, but I
>gather as much:
>
>* The "From " line is not only detected in the "header" lexer that is
>   to figure the structure, but also in the text/plain and text/html
>   lexer.
>
>My original idea about splitting the lexers was to separate functions,
>and it seems the current implementation misses the point.

Sorry, but I disagree.  The lexers _do_ separate functions, as far as they can.

There _are_ a few constructs common to all of them.  lexer_head.l and only 
lexer_head.l recognizes MIME-Version, Content-*, Date:.*, boundary=, 
etc.  All three of them recognize mime-boundaries, ip-addrs, empty lines, 
and ^From.  The rules for mime-boundaries and ^From are necessary because 
they signal the end of a message body section.

Yesterday I released a patch to recognize ^From based on lowest level, raw 
text and execute the end-of-message processing.  With that recognition the 
lexer rule for ^From only needs to evaluate "return (msg_header ? FROM : 
TOKEN)".  The changes have no impact on the regression tests.  The message 
counts from "grep ^From" to "bogofilter -s" are the same for allapprox 200 
or so mbox files on my development machine.  Counts are also the same for 
all the problem messages reported last week by Matt Armstrong, Ronald 
Coleman, Chris Wilkes, and Greg Louis.

ALL reported parsing bugs have been fixed.  What's lacking is feedback from 
last week's bug reporters confirming that the reported problems were fixed 
and reporting whether there are additional problems (or not).

I'm only aware of one outstanding issue of significance.  Greg Louis has 
reported that as the wordlists grow, BerkelyDB slows down 
significantly.  He's quite concerned about this.

>I suspect that other lexers still duplicate functionality of
>lexer_head.l, which they must not.
>
>My original idea was to have one lexer (lexer_head.l) to gather the
>structure, and pass decoded stuff down to the "token extracting"
>lexers. Given that "^From " lines will never be encoded, this is
>clean.

"Pass down" sounds like an interesting design.  Can you describe it in more 
detail?

>Any rules that are aware of the message or MIME structure in
>lexer_text_{plain,html}.l are clearly misplaced under these assumptions.

Since lexer_head doesn't control the others, as you envisioned, the above 
mentioned knowledge of ^From and of mime boundaries is needed.

>Do we have all of Matt's "interesting" messages that dug up these
>problems in bogofilter? I'd like to clean up this mess before we go
>stable, because my belly tells me that the current code is fragile.

As mentioned above bogofilter correctly processes all sample mailboxes and 
messages.

It's currently 8:00 AM in Michigan and time for me to head out for the 
day.  Current plans are for another bug fix release, bogofilter 0.10.1.2 
this evening, sometime after 7:00 PM.  For those of you east of Greenwich, 
that's in the wee hours of tomorrow :-)

Bye for now.

David