the RFC2047 problem
David Relson
relson at osagesoftware.com
Wed Dec 1 02:47:31 CET 2004
On Wed, 1 Dec 2004 02:37:37 +0100
Matthias Andree wrote:
> On Tue, 30 Nov 2004, David Relson wrote:
>
> > Evgeny's problem is a lexer problem. The "Content-Disposition:" has
> > two encoded tokens, with the first being correctly formed and the
> > second lacking the required "?=" termination. The lexer is trying
> > to match the improperly formed token with the contents of the
> > message. In this effort, the rest of the file is read into memory,
> > consuming time and memory.
>
> How about the patch below? It is supposed to limit the search to 1,000
> characters or the line feed, whichever is nearer. Please check if it
> fixes the speed problem, I don't have Evgeniy's full message for
> testing.
lexer repeat counts are very, very bad. They make it very large.
Before:
without {0,1000}:
-rw-r--r-- 1 ... 124029 Nov 30 13:15 lexer_v3.c
-rw-r--r-- 1 ... 12802 Nov 30 13:14 lexer_v3.l
-rw-r--r-- 1 ... 74796 Nov 30 13:15 lexer_v3.o
with {0,1000}:
-rw-r--r-- 1 ... 2687019 Nov 30 20:41 lexer_v3.c
-rw-r--r-- 1 ... 12809 Nov 30 20:41 lexer_v3.l
-rw-r--r-- 1 ... 1647532 Nov 30 20:41 lexer_v3.o
> If this fixes the performance issue, we should just update the
> "expect" data in the self-test - it's actually a good thing if a
> nonconformant encoded word leads to different lexer output than a
> conformant one - this way, we'll be able to train on bugs in spamware.
>
> New test output with the patch given below:
>
> 12,14c12
> < mime:goo
> < mime:Windows-1251
> < mime:fTu
> ---
> > mime:goo_________
> FAIL: t.rfc2047_broken
results look better, though the executable size change is a show
stopper.
I wonder if there's a reasonable way to unfold header lines and limit
their parsing to the single, unfolded line?
> /*
> > I've expanded t.rfc2047_broken to include this problem. File
>
> Thanks.
>
> > ${TMPDIR}/output.2a is bogolexer's output using the improperly
> > formed token and ${TMPDIR}/output.2b is output using a properly
> > formed token. Since these two outputs are different, the test _does_
> > FAIL (during make check).
>
> Good!
>
> > When we get this fixed, the test will PASS. If we decide to release
>
> That depends on how we fix. I'd rather recognize a "broken" encoded
> word for what it is, garbage. We needn't (shouldn't) get the same
> output as from a proper encoded word, as outlined above.
>
> > 0.93.2 before fixing this problem, we can comment out the test's
> > final"diff" so that "make check" can pass its tests.
>
> We'll need to review and possibly fix t.multiple-wordlists, too. The
> removal of the max{} fixup and fix of calc_prob to fall back to
> robx changes the rstats table format output, but unfortunately the
> output of printf when it sees "not a number" is not specified so we
> cannot do a 1:1 comparison - one system writes nan, one NaN, and some
> may include (-0x...........).
Please, do not change any test results until the calculations are
finalized.
More information about the bogofilter-dev
mailing list