the RFC2047 problem

Wed Dec 1 02:37:37 CET 2004

On Tue, 30 Nov 2004, David Relson wrote:

> Evgeny's problem is a lexer problem.  The "Content-Disposition:" has two
> encoded tokens, with the first being correctly formed and the second
> lacking the required "?=" termination.  The lexer is trying to match the
> improperly formed token with the contents of the message.  In this
> effort, the rest of the file is read into memory, consuming time and
> memory.

How about the patch below? It is supposed to limit the search to 1,000
characters or the line feed, whichever is nearer. Please check if it
fixes the speed problem, I don't have Evgeniy's full message for
testing.

If this fixes the performance issue, we should just update the "expect"
data in the self-test - it's actually a good thing if a nonconformant
encoded word leads to different lexer output than a conformant one -
this way, we'll be able to train on bugs in spamware.

New test output with the patch given below:

12,14c12
< mime:goo
< mime:Windows-1251
< mime:fTu
---
> mime:goo���
FAIL: t.rfc2047_broken

Index: lexer_v3.l
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/src/lexer_v3.l,v
retrieving revision 1.154
diff -u -r1.154 lexer_v3.l

--- lexer_v3.l	23 Nov 2004 04:28:02 -0000	1.154
+++ lexer_v3.l	1 Dec 2004 00:57:08 -0000
@@ -185,7 +185,7 @@
 HTML_ENCODING	"&#"x?[[:xdigit:]]+";"
 URL_ENCODING	"%"[[:xdigit:]][[:xdigit:]]
 
-ENCODED_WORD	=\?{CHARSET}\?[bq]\?[^?]*\?=
+ENCODED_WORD	=\?{CHARSET}\?[bq]\?[^?\n]{0,1000}\?=
 ENCODED_TOKEN	({TOKENFRONT}{TOKENMID})?({ENCODED_WORD}{WHITESPACE}+)*{ENCODED_WORD}
 
 /*
> I've expanded t.rfc2047_broken to include this problem.  File

Thanks.

> ${TMPDIR}/output.2a is bogolexer's output using the improperly formed
> token and ${TMPDIR}/output.2b is output using a properly formed token.
> Since these two outputs are different, the test _does_ FAIL (during make
> check).

Good!

> When we get this fixed, the test will PASS.  If we decide to release

That depends on how we fix. I'd rather recognize a "broken" encoded word
for what it is, garbage. We needn't (shouldn't) get the same output as
from a proper encoded word, as outlined above.

> 0.93.2 before fixing this problem, we can comment out the test's final
> "diff" so that "make check" can pass its tests.

We'll need to review and possibly fix t.multiple-wordlists, too. The
removal of the max{} fixup and fix of calc_prob to fall back to
robx changes the rstats table format output, but unfortunately the
output of printf when it sees "not a number" is not specified so we
cannot do a 1:1 comparison - one system writes nan, one NaN, and some
may include (-0x...........).

Kind regards,

-- 
Matthias Andree