[cvs] bogofilter/src lexer_v3.l,1.158,1.159

Sun Jun 26 18:10:15 CEST 2005

On Sun, 26 Jun 2005 10:50:41 +0200
Matthias Andree wrote:

> David Relson <relson at users.sourceforge.net> writes:

....[snip]...

> There can be more than one encoded word on the same line, so this skips
> the 2nd and all subsequent words. I have seen these in the wild in
> solicited mail, and such behavior is actually recommended by the
> standards, i. e. if I have a longish Subject with say six words with
> umlauts in the first and last word, the recommended encoding is to
> encode the first and last word and leve the four words in between
> unencoded and outside the "encoded-word" syntax elements.

Indeed there can be multiple encoded words per line.  If that didn't
complicate matters enough, there are the continuation lines allowed for
each header.  Also, white space is allowed between encoded words and is
ignored.  All those factors make the text_decode function quite
complicated.

Without unicode conversion, processing encoded words is guaranteed to
produce a shorter line because the charset goes away, as well as the
other formatting information.  With unicode conversion, the resultant
line can be longer than the original.  Such complications make the code
even more difficult.

> We'll need to find an approach that either tracks the position with
> character accuracy or, preferably, one that works without yy_unput.
> probably by moving RFC-2047 decoding out of the lexer into the MIME
> decoder, close to header unfolding.

Processing RFC2047 encoded words without yy_unput and recursion is
tricky.  On the one hand, the lexer specification is used to recognize
the encoded word.  After decoding, the resultant text must be tokenized
-- another task for the lexer.  Handling either of these tasks without
using lexer_v3.l adds complexity to the parsing.

Permit me to play devil's advocate here.  Does it matter if the header
lines are processed more than once for RFC2047 tokens?  Is there any
measurable effect on bogofilter's classification abilities?  If the
answer is no, perhaps we should document this as an unimportant
limitation.  What think you?

David