the RFC2047 problem

David Relson relson at osagesoftware.com
Wed Dec 1 02:47:31 CET 2004


On Wed, 1 Dec 2004 02:37:37 +0100
Matthias Andree wrote:

> On Tue, 30 Nov 2004, David Relson wrote:
> 
> > Evgeny's problem is a lexer problem.  The "Content-Disposition:" has
> > two encoded tokens, with the first being correctly formed and the
> > second lacking the required "?=" termination.  The lexer is trying
> > to match the improperly formed token with the contents of the
> > message.  In this effort, the rest of the file is read into memory,
> > consuming time and memory.
> 
> How about the patch below? It is supposed to limit the search to 1,000
> characters or the line feed, whichever is nearer. Please check if it
> fixes the speed problem, I don't have Evgeniy's full message for
> testing.

lexer repeat counts are very, very bad.  They make it very large.

Before:

without {0,1000}:
-rw-r--r--  1 ... 124029 Nov 30 13:15 lexer_v3.c
-rw-r--r--  1 ...  12802 Nov 30 13:14 lexer_v3.l
-rw-r--r--  1 ...  74796 Nov 30 13:15 lexer_v3.o

with {0,1000}:
-rw-r--r--  1 ... 2687019 Nov 30 20:41 lexer_v3.c
-rw-r--r--  1 ...   12809 Nov 30 20:41 lexer_v3.l
-rw-r--r--  1 ... 1647532 Nov 30 20:41 lexer_v3.o


> If this fixes the performance issue, we should just update the
> "expect" data in the self-test - it's actually a good thing if a
> nonconformant encoded word leads to different lexer output than a
> conformant one - this way, we'll be able to train on bugs in spamware.
> 
> New test output with the patch given below:
> 
> 12,14c12
> < mime:goo
> < mime:Windows-1251
> < mime:fTu
> ---
> > mime:goo_________
> FAIL: t.rfc2047_broken

results look better, though the executable size change is a show
stopper.

I wonder if there's a reasonable way to unfold header lines and limit
their parsing to the single, unfolded line?

>  /*
> > I've expanded t.rfc2047_broken to include this problem.  File
> 
> Thanks.
> 
> > ${TMPDIR}/output.2a is bogolexer's output using the improperly
> > formed token and ${TMPDIR}/output.2b is output using a properly
> > formed token. Since these two outputs are different, the test _does_
> > FAIL (during make check).
> 
> Good!
> 
> > When we get this fixed, the test will PASS.  If we decide to release
> 
> That depends on how we fix. I'd rather recognize a "broken" encoded
> word for what it is, garbage. We needn't (shouldn't) get the same
> output as from a proper encoded word, as outlined above.
> 
> > 0.93.2 before fixing this problem, we can comment out the test's
> > final"diff" so that "make check" can pass its tests.
> 
> We'll need to review and possibly fix t.multiple-wordlists, too. The
> removal of the max{} fixup and fix of calc_prob to fall back to
> robx changes the rstats table format output, but unfortunately the
> output of printf when it sees "not a number" is not specified so we
> cannot do a 1:1 comparison - one system writes nan, one NaN, and some
> may include (-0x...........).

Please, do not change any test results until the calculations are
finalized.



More information about the bogofilter-dev mailing list