the RFC2047 problem
Matthias Andree
matthias.andree at gmx.de
Thu Dec 2 00:13:15 CET 2004
David Relson <relson at osagesoftware.com> writes:
> lexer repeat counts are very, very bad. They make it very large.
>
> Before:
>
> without {0,1000}:
> -rw-r--r-- 1 ... 124029 Nov 30 13:15 lexer_v3.c
> -rw-r--r-- 1 ... 12802 Nov 30 13:14 lexer_v3.l
> -rw-r--r-- 1 ... 74796 Nov 30 13:15 lexer_v3.o
>
> with {0,1000}:
> -rw-r--r-- 1 ... 2687019 Nov 30 20:41 lexer_v3.c
> -rw-r--r-- 1 ... 12809 Nov 30 20:41 lexer_v3.l
> -rw-r--r-- 1 ... 1647532 Nov 30 20:41 lexer_v3.o
Just inserting \n into the [^...] but without {0,1000}
(flex 2.5.4 here):
-rw-r--r-- 1 69660 2004-12-02 00:09 build-db42/src/lexer_v3.o
-rw-r--r-- 1 113326 2004-12-02 00:09 src/lexer_v3.c
text data bss dec hex filename
47919 8 64 47991 bb77 build-db42/src/lexer_v3.o
Corresponding diff:
Index: src/lexer_v3.l
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/src/lexer_v3.l,v
retrieving revision 1.154
diff -u -r1.154 lexer_v3.l
--- src/lexer_v3.l 23 Nov 2004 04:28:02 -0000 1.154
+++ src/lexer_v3.l 1 Dec 2004 23:11:12 -0000
@@ -185,7 +185,7 @@
HTML_ENCODING "&#"x?[[:xdigit:]]+";"
URL_ENCODING "%"[[:xdigit:]][[:xdigit:]]
-ENCODED_WORD =\?{CHARSET}\?[bq]\?[^?]*\?=
+ENCODED_WORD =\?{CHARSET}\?[bq]\?[^?\n]*\?=
ENCODED_TOKEN ({TOKENFRONT}{TOKENMID})?({ENCODED_WORD}{WHITESPACE}+)*{ENCODED_WORD}
/*
Given that the encoded word must not be continued on the next line,
this is safe.
Is this OK to commit for the nonce?
We can still switch to another fix later.
--
Matthias Andree
More information about the bogofilter-dev
mailing list