the recent ^From issues.
Matt Armstrong
matt at lickey.com
Mon Jan 27 18:26:53 CET 2003
Matthias Andree <matthias.andree at gmx.de> writes:
> I have been unable to track bogofilter development last week, but I
> gather as much:
>
> * The "From " line is not only detected in the "header" lexer that is
> to figure the structure, but also in the text/plain and text/html
> lexer.
>
> My original idea about splitting the lexers was to separate functions,
> and it seems the current implementation misses the point.
>
> I suspect that other lexers still duplicate functionality of
> lexer_head.l, which they must not.
>
> My original idea was to have one lexer (lexer_head.l) to gather the
> structure, and pass decoded stuff down to the "token extracting"
> lexers. Given that "^From " lines will never be encoded, this is
> clean.
>
> Any rules that are aware of the message or MIME structure in
> lexer_text_{plain,html}.l are clearly misplaced under these assumptions.
>
> Do we have all of Matt's "interesting" messages that dug up these
> problems in bogofilter? I'd like to clean up this mess before we go
> stable, because my belly tells me that the current code is fragile.
Everything I've given Dave is at http://www.lickey.com/~matt/bogo/
But the only problem related to mbox parsing is the qp encoded "^From
" line.
Basically, the problem can be boiled down to:
- Bogofilter should not treat "^From " specially at all unless it
knows it is parsing a unix mbox file (Because such a line has no
special meaning outside of a unix mbox file).
This will prevent bogofilter from treating a single message as
multiple messages.
Bogofilter can tell if it is parsing a unix mbox file if the
first line of the input is "^From ".
(it looks like current CVS does this)
- When parsing an mbox file, the entire "^From " line should be
thrown away and not used as a source of tokens.
Rationale: It is not part of the message. If I split my mbox
into a Maildir I should get the same set of tokens when I train
bogofilter with it.
(current CVS does not do this)
- "^From " checking should happen before any MIME decoding
happens on the input stream.
Note: the qp-encoded "From " is a trick mailers can use to
preserve the content of their mail when they know it might be
stored in a unix mbox.
More information about the bogofilter-dev
mailing list