front end
David Relson
relson at osagesoftware.com
Mon Aug 11 14:55:04 CEST 2003
At 01:38 PM 8/1/03, Matthias Andree wrote:
>David Relson <relson at osagesoftware.com> writes:
>
> > The three input formats - mailbox, maildir, and message count - are all
> > useful, each in its own way. Output formatting, while not absolutely
> > necessary, helps make the results useful for the many environments the
> > program is used in.
>
>I don't dispute the necessity of reading several input formats, what I
>dislike is that the format is buried deep inside the lexer.
>
>What if we had modules like mbox.c maildir.c bulkreader.c that could
>each split their respective input into individual mails, call the
>"libbogofilter.so" functions to do the lexer and evaluation work?
Matthias,
It's been 10 days since you made the above suggestion and I thought "Well,
maybe, someday...". Recent events make me think that someday is soon :-)
Michael O'Reilly recently pointed out that in a mailbox (mbox file),
messages are separated by "^From\ " after an empty line (newline
only). Thus the proper lexer pattern is "\n\nFrom\ " not "^From\ " (as we
currently have). I made the change to see what would break (if
anything). No surprise, something _does_ break. Here's the scenario:
Consider a mime multipart message with a base64 encoded part. What
bogofilter currently does is process the mime part header up to the empty
line (which ends the header) and then calls base64_decode() for the
following lines. This decoding is done inside of yyinput(), so the lexer
sees the decoded line. The process works fine.
Now consider what happens with pattern "\n\nFrom\ " in the lexer. When the
lexer sees an empty line it reads the next line (to try to match "From\
"). With the mime multipart message (used above), the read gets the first
base64 encoded line of the mime part body. The lexer will decide that the
pattern wasn't matched and goes on to other patterns. However, it now has
the encoded line (not the decoded line), so the tokenizing is using the
wrong text and the results are garbage.
Right now, the solution seems to be a front end that breaks the input into
messages and then passes each message to parsing, registration,
classification, etc. Stated differently, bogofilter needs a formail type
capability.
David
More information about the Bogofilter
mailing list