Encoded filenames
David Relson
relson at osagesoftware.com
Tue Nov 30 13:46:49 CET 2004
On Tue, 30 Nov 2004 12:57:55 +0100
Matthias Andree wrote:
> On Mon, 29 Nov 2004, David Relson wrote:
>
> > You're going to love this. In Evgeny's message, the filenames are
> > base64 Windows-1251 encoded, i.e "=?Windows-1251?B?...", and are
> > quite log. Simply changing them to "filename.txt" (or anything else
> > simple and short), changes the processing time for the message to
> > 0.02 sec(from 12+ sec).
>
> There be RFC-2047 bugs. lexer_v3.l must not call text_decode(), but it
> does, and can hence recurse, leading to bogus results.
>
> Anyways, it appears we're treating everything as text that isn't
> text/html or message/*.
>
> I haven't got the test case here so I can't try. How's this patch?
> It's test neutral.
>
...[snip]....
Matthias,
It doesn't help the speed. With the patch, my time is 12.69, about the
same as before.
<SKIP>. /* ignore character */
<SKIP>{TOKEN} /* ignore tokens */
<SKIP>\n\n { BEGIN TEXT; }
With the addition of the SKIP.TOKEN statement, the message returns a
reasonable number of tokens - 95 (instead of 10,863).
As an additional note, bogofilter's -vvv is slightly damaged in CVS.
It's missing a token count.
As to RFC-2047, _something_ is needed to decode encoded words like
=charset?Q?...? The current use of text_decode() for this looks fine to
me. I don't see the recursion you've mentioned. Can you send me a
test case?
Regards,
David
--
David Relson Osage Software Systems, Inc.
relson at osagesoftware.com Ann Arbor, MI 48103
www.osagesoftware.com tel: 734.821.8800
More information about the bogofilter-dev
mailing list