the RFC2047 problem
Matthias Andree
matthias.andree at gmx.de
Wed Dec 1 12:09:14 CET 2004
David Relson <relson at osagesoftware.com> writes:
>> How about the patch below? It is supposed to limit the search to 1,000
>> characters or the line feed, whichever is nearer. Please check if it
>> fixes the speed problem, I don't have Evgeniy's full message for
>> testing.
>
> lexer repeat counts are very, very bad. They make it very large.
Argh, wasn't aware of that.
Then let's just use the \n inside the [^...], that prevents looking
beyond the end of the line.
> Before:
>
> without {0,1000}:
> -rw-r--r-- 1 ... 124029 Nov 30 13:15 lexer_v3.c
> -rw-r--r-- 1 ... 12802 Nov 30 13:14 lexer_v3.l
> -rw-r--r-- 1 ... 74796 Nov 30 13:15 lexer_v3.o
>
> with {0,1000}:
> -rw-r--r-- 1 ... 2687019 Nov 30 20:41 lexer_v3.c
> -rw-r--r-- 1 ... 12809 Nov 30 20:41 lexer_v3.l
> -rw-r--r-- 1 ... 1647532 Nov 30 20:41 lexer_v3.o
Holy cow.
> I wonder if there's a reasonable way to unfold header lines and limit
> their parsing to the single, unfolded line?
Good plan. Let's make that "unfold and RFC2047-decode" header lines,
that makes text_decode run only once per token (rather than
recursively), too.
>> We'll need to review and possibly fix t.multiple-wordlists, too. The
>> removal of the max{} fixup and fix of calc_prob to fall back to
>> robx changes the rstats table format output, but unfortunately the
>> output of printf when it sees "not a number" is not specified so we
>> cannot do a 1:1 comparison - one system writes nan, one NaN, and some
>> may include (-0x...........).
>
> Please, do not change any test results until the calculations are
> finalized.
OK. I'd rather revise the whole test, construct a particular test case,
and check the result without looking at rstats.
--
Matthias Andree
More information about the bogofilter-dev
mailing list