further text_decode() issues

Sat Oct 9 04:42:06 CEST 2004

I have some more issues with text_decode:

1. why are we recursively decoding RFC-2047? Doesn't seem right to me.
   (Cause: we're pushing the RFC-2047-decoded stuff back into our input
   queue without blocking RFC-2047 recursion.)

   Proof:
   echo 'header: =?US-ASCII?Q?=3D=3FUS-ASCII=3FQ=3F=3D3D=3D3FUS-ASCII=3D3FQ=3D3Ftest=3D3F=3D3D=3F=3D?=' | bogolexer

   Yes I know the header word isn't RFC-2047 conformant because of its
   size, bogofilter doesn't care.

2. the parser is apparently not robust, it makes assumptions that the
   encoded word is well-formatted. What if it isn't? I haven't yet
   managed to break it after the fix but I haven't tried for long.
   Maybe the lexer_v3 helps avoiding the bugs the code still has.

   If someone can come up with a test case that breaks bogolexer's
   RFC-2047 decoder in bogofilter's current CVS version (trunk version,
   not txn branch!), please let me know.

3. we should probably unfold the input header lines before letting the
   lexer treat them. We could then also remove the RFC-2047 rules from
   lexer_v3.l. (I cannot do this before Sunday.)

-- 
Matthias Andree

Encrypted mail welcome: my GnuPG key ID is 0x052E7D95 (PGP/MIME preferred)