the RFC2047 problem

Matthias Andree matthias.andree at gmx.de
Wed Dec 1 12:09:14 CET 2004


David Relson <relson at osagesoftware.com> writes:

>> How about the patch below? It is supposed to limit the search to 1,000
>> characters or the line feed, whichever is nearer. Please check if it
>> fixes the speed problem, I don't have Evgeniy's full message for
>> testing.
>
> lexer repeat counts are very, very bad.  They make it very large.

Argh, wasn't aware of that.

Then let's just use the \n inside the [^...], that prevents looking
beyond the end of the line.

> Before:
>
> without {0,1000}:
> -rw-r--r--  1 ... 124029 Nov 30 13:15 lexer_v3.c
> -rw-r--r--  1 ...  12802 Nov 30 13:14 lexer_v3.l
> -rw-r--r--  1 ...  74796 Nov 30 13:15 lexer_v3.o
>
> with {0,1000}:
> -rw-r--r--  1 ... 2687019 Nov 30 20:41 lexer_v3.c
> -rw-r--r--  1 ...   12809 Nov 30 20:41 lexer_v3.l
> -rw-r--r--  1 ... 1647532 Nov 30 20:41 lexer_v3.o

Holy cow.

> I wonder if there's a reasonable way to unfold header lines and limit
> their parsing to the single, unfolded line?

Good plan. Let's make that "unfold and RFC2047-decode" header lines,
that makes text_decode run only once per token (rather than
recursively), too.

>> We'll need to review and possibly fix t.multiple-wordlists, too. The
>> removal of the max{} fixup and fix of calc_prob to fall back to
>> robx changes the rstats table format output, but unfortunately the
>> output of printf when it sees "not a number" is not specified so we
>> cannot do a 1:1 comparison - one system writes nan, one NaN, and some
>> may include (-0x...........).
>
> Please, do not change any test results until the calculations are
> finalized.

OK. I'd rather revise the whole test, construct a particular test case,
and check the result without looking at rstats.

-- 
Matthias Andree



More information about the bogofilter-dev mailing list