front end

David Relson relson at osagesoftware.com
Mon Aug 11 14:55:04 CEST 2003


At 01:38 PM 8/1/03, Matthias Andree wrote:
>David Relson <relson at osagesoftware.com> writes:
>
> > The three input formats - mailbox, maildir, and message count - are all
> > useful, each in its own way.  Output formatting, while not absolutely
> > necessary, helps make the results useful for the many environments the
> > program is used in.
>
>I don't dispute the necessity of reading several input formats, what I
>dislike is that the format is buried deep inside the lexer.
>
>What if we had modules like mbox.c maildir.c bulkreader.c that could
>each split their respective input into individual mails, call the
>"libbogofilter.so" functions to do the lexer and evaluation work?

Matthias,

It's been 10 days since you made the above suggestion and I thought "Well, 
maybe, someday...".  Recent events make me think that someday is soon :-)

Michael O'Reilly recently pointed out that in a mailbox (mbox file), 
messages are separated by "^From\ " after an empty line (newline 
only).  Thus the proper lexer pattern is "\n\nFrom\ " not "^From\ " (as we 
currently have).  I made the change to see what would break (if 
anything).  No surprise, something _does_ break.  Here's the scenario:

Consider a mime multipart message with a base64 encoded part.  What 
bogofilter currently does is process the mime part header up to the empty 
line (which ends the header) and then calls base64_decode() for the 
following lines.  This decoding is done inside of yyinput(), so the lexer 
sees the decoded line.  The process works fine.

Now consider what happens with pattern "\n\nFrom\ " in the lexer.  When the 
lexer sees an empty line it reads the next line (to try to match "From\ 
").  With the mime multipart message (used above), the read gets the first 
base64 encoded line of the mime part body.  The lexer will decide that the 
pattern wasn't matched and goes on to other patterns.  However, it now has 
the encoded line (not the decoded line), so the tokenizing is using the 
wrong text and the results are garbage.

Right now, the solution seems to be a front end that breaks the input into 
messages and then passes each message to parsing, registration, 
classification, etc.  Stated differently, bogofilter needs a formail type 
capability.

David





More information about the Bogofilter mailing list