Encoded headers parser

Sat Jul 26 21:39:19 CEST 2003

On Fri, Jul 25, 2003 at 03:59:13PM -0400, David Relson wrote:
| Junior,
| 
| Currently, bogofilter has the following in lexer_v3.l:
| 
| BASE64          [0-9a-zA-Z/+=]+
| QP              [^[:blank:]]+
| ENCODED_TOKEN   {BOGOLEX_TOKEN}*=\?{ID}\?(b\?{BASE64}|q\?{QP})\?\=
| 
| It's more specific than what you're running and should do better.  Let me 
| know how it goes.

It seems to be correct, but the <INITIAL>{ENCODED_TOKEN} in the lexer
*is* running in the message body, I compiled with flex -d. I don't know
much these things, but the problem maybe in the logic, and not in the
rules.

| 
| At 12:45 PM 7/25/03, Junior wrote:
| 
| ...[snip]...
| 
| >Only this chunk won't break, but sorry, I can't sent the emails, because
| >they have proprietary material. But it would be easy to reproduce, just
| >telling someone to sent a Outlock mail with a accented char filename
| >(big) anexed.
| 
| I don't have ready access to Outlook, so I'll await your test results.
| 

It's breaking :(

The snippet of lexer output:

	--accepting rule at line 210 ("name="")
	--(end of buffer or a NUL)
	--accepting rule at line 189 ("=?iso-8859-1?Q?RESOLU=C7=C3O-P1-CCO06.doc?=")
	--accepting rule at line 257 ("RESOLUÇÃO-P1-CCO06.doc")
	--accepting rule at line 262 (""")
	--accepting rule at line 263 ("
	")
	--accepting rule at line 197 ("Content-Transfer-Encoding: base64")
	--accepting rule at line 262 (" ")
	--accepting rule at line 257 ("base64")
	--accepting rule at line 263 ("
	")
	--(end of buffer or a NUL)
	--accepting rule at line 199 ("Content-Disposition: attachment")
	--accepting rule at line 262 (" ")
	--accepting rule at line 257 ("attachment")
	--accepting rule at line 262 (";")
	--accepting rule at line 263 ("
	")
	--(end of buffer or a NUL)
	--accepting rule at line 262 (" ")
	--accepting rule at line 211 ("filename="")
	--(end of buffer or a NUL)
	--(end of buffer or a NUL)
	--(end of buffer or a NUL)
	--(end of buffer or a NUL)
	(and these end of buffer goes in loop until crash bogofilter)

The chunk of email, to you verify:

	------=_NextPart_000_0005_01C20B05.A109EF00
	Content-Type: application/msword;
			name="=?iso-8859-1?Q?RESOLU=C7=C3O-P1-CCO06.doc?="
	Content-Transfer-Encoding: base64
	Content-Disposition: attachment;
			filename="=?iso-8859-1?Q?RESOLU=C7=C3O-P1-CCO06.doc?="

	0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAABAAAAYgAAAAAAAAAA
	EAAAZAAAAAEAAAD+////AAAAAGEAAAD/////////////////////////////////////////////
	////////////////////////////////////////////////////////////////////////////

When the mime is smaller, bogofilter continues to parse correctly, but
when it is bigger, it crashes. I suspect that the problem is in the line
of the filename, it tries to decode and goes with the mime too. If I
put a space before the last ?=, it works ok!

The "mailer" used to sent this email (I suspect that this is widely used :)

	X-Mailer: Microsoft Outlook Express 5.00.2919.6700
	X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700
	MIME-Version: 1.0

Hope this help. And good luck with this great work, supporting decoded
tokens are the feature that I most was waiting, because here in Brazil
most of the headers have text with accents, and come encoded in QP or
B64. It should improve much the accuracy.

Your's best! (or, sorry I don't know much English and I need to learn
some ways to finish my emails without they appears cold or
unpolite :D )

| >Anyway, these type of decoding would be nicer only in the headers,
| >because the encoded fields are almost From, Sender, Reply-To, Subject
| >(fields that goes with 8 bit chars, like that ones that contains names,
| >etc). It should speed up the parsing, and these encodings in the body
| >are rarelly found (or I'm wrong?).
| 
| Using <INITIAL> in the patterns restricts it to headers.
| 
| >After running flex, I got a 3MB lexer_v3.c, is that correct?
| 
| Yes.  With the pattern above, _my_ .c file is 2,648,183 bytes and the .o 
| file is 1,643,852 bytes.
| 
| Cheers!
| 
| David

-- 
Junior
jxz at uol.com.br 
http://jxz.dontexist.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20030726/716ab5d6/attachment.sig>