the RFC2047 problem
Matthias Andree
matthias.andree at gmx.de
Wed Dec 1 02:37:37 CET 2004
On Tue, 30 Nov 2004, David Relson wrote:
> Evgeny's problem is a lexer problem. The "Content-Disposition:" has two
> encoded tokens, with the first being correctly formed and the second
> lacking the required "?=" termination. The lexer is trying to match the
> improperly formed token with the contents of the message. In this
> effort, the rest of the file is read into memory, consuming time and
> memory.
How about the patch below? It is supposed to limit the search to 1,000
characters or the line feed, whichever is nearer. Please check if it
fixes the speed problem, I don't have Evgeniy's full message for
testing.
If this fixes the performance issue, we should just update the "expect"
data in the self-test - it's actually a good thing if a nonconformant
encoded word leads to different lexer output than a conformant one -
this way, we'll be able to train on bugs in spamware.
New test output with the patch given below:
12,14c12
< mime:goo
< mime:Windows-1251
< mime:fTu
---
> mime:goo���
FAIL: t.rfc2047_broken
Index: lexer_v3.l
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/src/lexer_v3.l,v
retrieving revision 1.154
diff -u -r1.154 lexer_v3.l
--- lexer_v3.l 23 Nov 2004 04:28:02 -0000 1.154
+++ lexer_v3.l 1 Dec 2004 00:57:08 -0000
@@ -185,7 +185,7 @@
HTML_ENCODING "&#"x?[[:xdigit:]]+";"
URL_ENCODING "%"[[:xdigit:]][[:xdigit:]]
-ENCODED_WORD =\?{CHARSET}\?[bq]\?[^?]*\?=
+ENCODED_WORD =\?{CHARSET}\?[bq]\?[^?\n]{0,1000}\?=
ENCODED_TOKEN ({TOKENFRONT}{TOKENMID})?({ENCODED_WORD}{WHITESPACE}+)*{ENCODED_WORD}
/*
> I've expanded t.rfc2047_broken to include this problem. File
Thanks.
> ${TMPDIR}/output.2a is bogolexer's output using the improperly formed
> token and ${TMPDIR}/output.2b is output using a properly formed token.
> Since these two outputs are different, the test _does_ FAIL (during make
> check).
Good!
> When we get this fixed, the test will PASS. If we decide to release
That depends on how we fix. I'd rather recognize a "broken" encoded word
for what it is, garbage. We needn't (shouldn't) get the same output as
from a proper encoded word, as outlined above.
> 0.93.2 before fixing this problem, we can comment out the test's final
> "diff" so that "make check" can pass its tests.
We'll need to review and possibly fix t.multiple-wordlists, too. The
removal of the max{} fixup and fix of calc_prob to fall back to
robx changes the rstats table format output, but unfortunately the
output of printf when it sees "not a number" is not specified so we
cannot do a 1:1 comparison - one system writes nan, one NaN, and some
may include (-0x...........).
Kind regards,
--
Matthias Andree
More information about the bogofilter-dev
mailing list