the RFC2047 problem

Thu Dec 2 00:13:15 CET 2004

David Relson <relson at osagesoftware.com> writes:

> lexer repeat counts are very, very bad.  They make it very large.
>
> Before:
>
> without {0,1000}:
> -rw-r--r--  1 ... 124029 Nov 30 13:15 lexer_v3.c
> -rw-r--r--  1 ...  12802 Nov 30 13:14 lexer_v3.l
> -rw-r--r--  1 ...  74796 Nov 30 13:15 lexer_v3.o
>
> with {0,1000}:
> -rw-r--r--  1 ... 2687019 Nov 30 20:41 lexer_v3.c
> -rw-r--r--  1 ...   12809 Nov 30 20:41 lexer_v3.l
> -rw-r--r--  1 ... 1647532 Nov 30 20:41 lexer_v3.o

Just inserting \n into the [^...] but without {0,1000}
(flex 2.5.4 here):

-rw-r--r--  1   69660 2004-12-02 00:09 build-db42/src/lexer_v3.o
-rw-r--r--  1  113326 2004-12-02 00:09 src/lexer_v3.c

   text    data     bss     dec     hex filename
  47919       8      64   47991    bb77 build-db42/src/lexer_v3.o

Corresponding diff:

Index: src/lexer_v3.l
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/src/lexer_v3.l,v
retrieving revision 1.154
diff -u -r1.154 lexer_v3.l

--- src/lexer_v3.l	23 Nov 2004 04:28:02 -0000	1.154
+++ src/lexer_v3.l	1 Dec 2004 23:11:12 -0000
@@ -185,7 +185,7 @@
 HTML_ENCODING	"&#"x?[[:xdigit:]]+";"
 URL_ENCODING	"%"[[:xdigit:]][[:xdigit:]]
 
-ENCODED_WORD	=\?{CHARSET}\?[bq]\?[^?]*\?=
+ENCODED_WORD	=\?{CHARSET}\?[bq]\?[^?\n]*\?=
 ENCODED_TOKEN	({TOKENFRONT}{TOKENMID})?({ENCODED_WORD}{WHITESPACE}+)*{ENCODED_WORD}
 
 /*

Given that the encoded word must not be continued on the next line,
this is safe.

Is this OK to commit for the nonce?
We can still switch to another fix later.

-- 
Matthias Andree