Encoded filenames

Tue Nov 30 13:46:49 CET 2004

On Tue, 30 Nov 2004 12:57:55 +0100
Matthias Andree wrote:

> On Mon, 29 Nov 2004, David Relson wrote:
> 
> > You're going to love this.  In Evgeny's message, the filenames are
> > base64 Windows-1251 encoded, i.e "=?Windows-1251?B?...", and are
> > quite log.  Simply changing them to "filename.txt" (or anything else
> > simple and short), changes the processing time for the message to
> > 0.02 sec(from 12+ sec).
> 
> There be RFC-2047 bugs. lexer_v3.l must not call text_decode(), but it
> does, and can hence recurse, leading to bogus results.
> 
> Anyways, it appears we're treating everything as text that isn't
> text/html or message/*.
> 
> I haven't got the test case here so I can't try. How's this patch?
> It's test neutral.
> 
...[snip]....

Matthias,

It doesn't help the speed.  With the patch, my time is 12.69, about the
same as before.

<SKIP>.		/* ignore character */
<SKIP>{TOKEN}	/* ignore tokens */
<SKIP>\n\n	{ BEGIN TEXT; }

With the addition of the SKIP.TOKEN statement, the message returns a
reasonable number of tokens - 95 (instead of 10,863).

As an additional note, bogofilter's -vvv is slightly damaged in CVS. 
It's missing a token count.

As to RFC-2047, _something_ is needed to decode encoded words like
=charset?Q?...?  The current use of text_decode() for this looks fine to
me.  I don't see the recursion you've mentioned.   Can you send me a
test case?

Regards,

David

-- 
David Relson                   Osage Software Systems, Inc.
relson at osagesoftware.com       Ann Arbor, MI 48103
www.osagesoftware.com          tel:  734.821.8800