0.10.1.1 status report

Mon Jan 27 15:03:58 CET 2003

David Relson <relson at osagesoftware.com> writes:

> Sorry, but I disagree.  The lexers _do_ separate functions, as far as
> they can.
>
> There _are_ a few constructs common to all of them.  lexer_head.l and
> only lexer_head.l recognizes MIME-Version, Content-*, Date:.*,
> boundary=, etc.  All three of them recognize mime-boundaries,
> ip-addrs, empty lines, and ^From.  The rules for mime-boundaries and
> ^From are necessary because they signal the end of a message body
> section.

This needs to happen earlier. Non-body parts should not be passed to the
text/html or text/plain lexers in the first place.

> Yesterday I released a patch to recognize ^From based on lowest level,
> raw text and execute the end-of-message processing.

I've seen that code, but this is going the "ad-hoc" fix direction, which
will get in our way maintaining the code.

> I'm only aware of one outstanding issue of significance.  Greg Louis has
> reported that as the wordlists grow, BerkelyDB slows down significantly.
> He's quite concerned about this.

Well, what version does he use and what size is his data base? db_stat
might be helpful to look at.

> "Pass down" sounds like an interesting design.  Can you describe it in
> more detail?

I haven't yet the full concept, the idea is that we have a layered
approach:

1. The first stage only knows enough to tell headers from bodies, one
   mail from the next, extracting necessary MIME information such as
   MIME-Version aka. "is this MIME at all", Content-Type,
   Content-Transfer-Encoding and other header information (charset).

   It also has exclusive "rights" to decode base64 and quoted-printable.

   It takes care of dispatching the bodies to the right text_* lexer.

2. The second stage is called by the first stage, if and only if the
   first stage is in "body" state and a "text/*" content-type. It gets
   its input from the first stage exclusively. There is no need to
   detect headers and bodies, just "end of input".

What is missing in this picture is:

Lexer has a "pull" approach (by means of yyinput), but for stage one, a
"push" approach would simplify the dispatching big time. I don't yet
have a good idea how to attack this.

>>Any rules that are aware of the message or MIME structure in
>>lexer_text_{plain,html}.l are clearly misplaced under these assumptions.
>
> Since lexer_head doesn't control the others, as you envisioned, the
> above mentioned knowledge of ^From and of mime boundaries is needed.

Yup, IMHO it must go. One functionality, one function. IMO. As more code
is added (who knows if we plug in gocr into bogofilter by Summer 2004),
any other approach (duplicated checks) will make the stuff
unmaintainable. Let's keep this clean in its early stages, it will pay
off in the long run.

> It's currently 8:00 AM in Michigan and time for me to head out for the
> day.  Current plans are for another bug fix release, bogofilter 0.10.1.2
> this evening, sometime after 7:00 PM.  For those of you east of
> Greenwich, that's in the wee hours of tomorrow :-)

Central European Time is 6 hours ahead of Michigan, except for the week
between last Sunday in March and first Sunday in April.
http://webexhibits.org/daylightsaving/eu.html :-)

Oh, and I figured MI is EST, WI is CST.  http://www.worldtimezone.com/

HAND,

-- 
Matthias Andree