unfolding header lines

Thu Sep 4 02:13:09 CEST 2003

Greetings all,

As you all know, bogofilter now (as of 0.15.1) knows about unfolding
header lines.  This is useful as it allows tagging of all the tokens of
a multi-line To:, Subject:, From:, or Return-Path: header line.

The initial implementation is done by C function get_unfolded_line()
which does a bit of pre-reading of text to identify folded lines.  The
code converts the newlines encountered to spaces.  It all works great -
until the folded line far exceeds the prescribed max line length
(RFC-2822, 998 characters).  When the input buffer gets close to full
(over 8k), the function returns and the remainder of the folded line
isn't tagged.

It has been suggested that the flex grammar, i.e. lexer_v3.l, is the
right place to handle the unfolding.  Indeed, it's quite easy to change
the grammar to recognize the folded lines.  However, there are side
effects.  First, other rules in the grammar need to be changed to allow
newlines (which are converted to spaces in the C function).  This is
easy and doesn't affect much.  Second, code must be added to the grammar
to indicate when to stop using the current tag (so only the desired
lines are tagged).  These changes are more numerous and somewhat messier
than the newline changes.  They are doable and work, except for one
problem.  At the end of every message header and mime body part header
is an empty line.  Using the C code, the pattern for the line is "^[
\t]*$".  When the unfolding work shifts into lexer_v3.l, the pattern
becomes "\n[ \t]*\n" and this causes trouble.  The lexer is in header
mode as it reads the empty line and as it pre-reads the line _after_
that.  Being in header mode, base64 and qp decoding don't get applied. 
End of story :-(

If any of you are interested in trying to solve this problem, I can
provide a patch and a test message that fails as described.

David