the recent ^From issues.

Matt Armstrong matt at lickey.com
Mon Jan 27 18:26:53 CET 2003


Matthias Andree <matthias.andree at gmx.de> writes:

> I have been unable to track bogofilter development last week, but I
> gather as much:
>
> * The "From " line is not only detected in the "header" lexer that is
>   to figure the structure, but also in the text/plain and text/html
>   lexer.
>
> My original idea about splitting the lexers was to separate functions,
> and it seems the current implementation misses the point.
>
> I suspect that other lexers still duplicate functionality of
> lexer_head.l, which they must not.
>
> My original idea was to have one lexer (lexer_head.l) to gather the
> structure, and pass decoded stuff down to the "token extracting"
> lexers. Given that "^From " lines will never be encoded, this is
> clean.
>
> Any rules that are aware of the message or MIME structure in
> lexer_text_{plain,html}.l are clearly misplaced under these assumptions.
>
> Do we have all of Matt's "interesting" messages that dug up these
> problems in bogofilter? I'd like to clean up this mess before we go
> stable, because my belly tells me that the current code is fragile.

Everything I've given Dave is at http://www.lickey.com/~matt/bogo/

But the only problem related to mbox parsing is the qp encoded "^From
" line.

Basically, the problem can be boiled down to:

    - Bogofilter should not treat "^From " specially at all unless it
      knows it is parsing a unix mbox file (Because such a line has no
      special meaning outside of a unix mbox file).

      This will prevent bogofilter from treating a single message as
      multiple messages.

      Bogofilter can tell if it is parsing a unix mbox file if the
      first line of the input is "^From ".

      (it looks like current CVS does this)

    - When parsing an mbox file, the entire "^From " line should be
      thrown away and not used as a source of tokens.

      Rationale: It is not part of the message.  If I split my mbox
      into a Maildir I should get the same set of tokens when I train
      bogofilter with it.

      (current CVS does not do this)

    - "^From " checking should happen before any MIME decoding
      happens on the input stream.

      Note: the qp-encoded "From " is a trick mailers can use to
      preserve the content of their mail when they know it might be
      stored in a unix mbox.




More information about the bogofilter-dev mailing list