the recent ^From issues.

Herman Oosthuysen Herman at WirelessNetworksInc.com
Mon Jan 27 19:53:56 CET 2003


Uhmmm, it is my experience that many ISPs strip the From line from the 
mail.  Consequently, the first thing my Procmail has to do is rebuild 
the From line using a call to Formail. So, relying on the From line as a 
message delimiter is not very robust.

Matt Armstrong wrote:
> Matthias Andree <matthias.andree at gmx.de> writes:
> 
> 
>>I have been unable to track bogofilter development last week, but I
>>gather as much:
>>
>>* The "From " line is not only detected in the "header" lexer that is
>>  to figure the structure, but also in the text/plain and text/html
>>  lexer.
>>
>>My original idea about splitting the lexers was to separate functions,
>>and it seems the current implementation misses the point.
>>
>>I suspect that other lexers still duplicate functionality of
>>lexer_head.l, which they must not.
>>
>>My original idea was to have one lexer (lexer_head.l) to gather the
>>structure, and pass decoded stuff down to the "token extracting"
>>lexers. Given that "^From " lines will never be encoded, this is
>>clean.
>>
>>Any rules that are aware of the message or MIME structure in
>>lexer_text_{plain,html}.l are clearly misplaced under these assumptions.
>>
>>Do we have all of Matt's "interesting" messages that dug up these
>>problems in bogofilter? I'd like to clean up this mess before we go
>>stable, because my belly tells me that the current code is fragile.
> 
> 
> Everything I've given Dave is at http://www.lickey.com/~matt/bogo/
> 
> But the only problem related to mbox parsing is the qp encoded "^From
> " line.
> 
> Basically, the problem can be boiled down to:
> 
>     - Bogofilter should not treat "^From " specially at all unless it
>       knows it is parsing a unix mbox file (Because such a line has no
>       special meaning outside of a unix mbox file).
> 
>       This will prevent bogofilter from treating a single message as
>       multiple messages.
> 
>       Bogofilter can tell if it is parsing a unix mbox file if the
>       first line of the input is "^From ".
> 
>       (it looks like current CVS does this)
> 
>     - When parsing an mbox file, the entire "^From " line should be
>       thrown away and not used as a source of tokens.
> 
>       Rationale: It is not part of the message.  If I split my mbox
>       into a Maildir I should get the same set of tokens when I train
>       bogofilter with it.
> 
>       (current CVS does not do this)
> 
>     - "^From " checking should happen before any MIME decoding
>       happens on the input stream.
> 
>       Note: the qp-encoded "From " is a trick mailers can use to
>       preserve the content of their mail when they know it might be
>       stored in a unix mbox.
> 
> ---------------------------------------------------------------------
> FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
> To unsubscribe, e-mail: bogofilter-dev-unsubscribe at aotto.com
> For summary digest subscription: bogofilter-dev-digest-subscribe at aotto.com
> For more commands, e-mail: bogofilter-dev-help at aotto.com
> 
> 

-- 

Herman Oosthuysen
B.Eng.(E), Member of IEEE
Wireless Networks Inc.
http://www.WirelessNetworksInc.com
E-mail: Herman at WirelessNetworksInc.com
Phone: 1.403.569-5687, Fax: 1.403.235-3965






More information about the bogofilter-dev mailing list