Encoded headers parser

Fri Jul 25 18:45:18 CEST 2003

On Fri, Jul 25, 2003 at 12:21:48AM -0400, David Relson wrote:
| Junior,
| 
| Thanks for reporting the problem.  As you probably realize, parsing encoded 
| headers is new and the proper lexer pattern is still being worked on.  My 
| latest effort (below) handles your test case.  I've also tested it with 
| approx 100 encoded subject lines that I found in my email archives and 
| don't see anything bad (except for the decoding of iso-2022-jp).  Give it a 
| try and let me know what problems you encounter.
| 
| --- lexer_v3.l  24 Jul 2003 20:48:51 -0000      1.40
| +++ lexer_v3.l  25 Jul 2003 03:58:19 -0000
| @@ -126,8 +126,7 @@
| 
|  TOKEN          {TOKENFRONT}{TOKENMID}{TOKENBACK}{1,70}
| 
| -B64_OR_QP      [0-9a-zA-Z\-\+\/\=_:]+
| -ENCODED_TOKEN  {BOGOLEX_TOKEN}{0,60}=\?{ID}\?[bq]\?{B64_OR_QP}\?\=
| +ENCODED_TOKEN  {BOGOLEX_TOKEN}{0,60}=\?{ID}\?[bq]\?[^[:blank:]]+\?\=
| 
|  DOLLARS                [0-9]+
|  CENTS          [0-9]+
| 

Trying to register some mboxes, bogofilter keep breaking with:

	Invalid buffer size, exiting.
	Aborted

Here, the chunk of content:

	[cut]
	------=_NextPart_000_0005_01C27A09.43DD34E0
	Content-Type: application/msword;
		name="=?iso-8859-1?Q?CAP=CDTULO_11.doc?="
	Content-Transfer-Encoding: base64
	Content-Disposition: attachment;
		filename="=?iso-8859-1?Q?CAP=CDTULO_11.doc?="

	0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAACAAAAzgAAAAAAAAAA
	EAAA0AAAAAEAAAD+////AAAAAMwAAADNAAAA////////////////////////////////////////
	////////////////////////////////////////////////////////////////////////////
	[/cut]

Using bogolexer, the last tokens seen was:

	Content-Type
	application
	msword
	CAPÍTULO
	doc
	Content-Transfer-Encoding
	base64
	Content-Disposition
	attachment

I suspect bogofilter is passing the final ?= boundary and trying to
decode all the mime stuff (that is big), and break. Below my change of
the lexer, making it works ok:

	ENCODED_TOKEN  {BOGOLEX_TOKEN}{0,60}=\?{ID}\?[bq]\?[^[:blank:]\r\n]+\?\=

I added \r and \n to prevent this things happen. I know it's a temporary
remedy, until something more robust comes.

	Content-Type
	application
	msword
	CAPÍTULO
	doc
	Content-Transfer-Encoding
	base64
	Content-Disposition
	attachment
	CAPÍTULO
	doc

Only this chunk won't break, but sorry, I can't sent the emails, because
they have proprietary material. But it would be easy to reproduce, just
telling someone to sent a Outlock mail with a accented char filename
(big) anexed.

Anyway, these type of decoding would be nicer only in the headers,
because the encoded fields are almost From, Sender, Reply-To, Subject
(fields that goes with 8 bit chars, like that ones that contains names,
etc). It should speed up the parsing, and these encodings in the body
are rarelly found (or I'm wrong?).

After running flex, I got a 3MB lexer_v3.c, is that correct?

Now I will try to re-register the emails, and test a bit more the
decoding.

| >Another question: will bogofilter ever support double tokens store
| >(phrases)? It would improve accuracy, but you don't do it because the
| >performance issue?
| 
| Current thought is to get bogofilter doing the best possible job with 
| single tokens.  After that we can work on phrases.
| 

-- 
Junior
jxz at uol.com.br 
http://jxz.dontexist.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20030725/77b50551/attachment.sig>