Encoded headers parser
Junior
jxz at uol.com.br
Fri Jul 25 18:45:18 CEST 2003
On Fri, Jul 25, 2003 at 12:21:48AM -0400, David Relson wrote:
| Junior,
|
| Thanks for reporting the problem. As you probably realize, parsing encoded
| headers is new and the proper lexer pattern is still being worked on. My
| latest effort (below) handles your test case. I've also tested it with
| approx 100 encoded subject lines that I found in my email archives and
| don't see anything bad (except for the decoding of iso-2022-jp). Give it a
| try and let me know what problems you encounter.
|
| --- lexer_v3.l 24 Jul 2003 20:48:51 -0000 1.40
| +++ lexer_v3.l 25 Jul 2003 03:58:19 -0000
| @@ -126,8 +126,7 @@
|
| TOKEN {TOKENFRONT}{TOKENMID}{TOKENBACK}{1,70}
|
| -B64_OR_QP [0-9a-zA-Z\-\+\/\=_:]+
| -ENCODED_TOKEN {BOGOLEX_TOKEN}{0,60}=\?{ID}\?[bq]\?{B64_OR_QP}\?\=
| +ENCODED_TOKEN {BOGOLEX_TOKEN}{0,60}=\?{ID}\?[bq]\?[^[:blank:]]+\?\=
|
| DOLLARS [0-9]+
| CENTS [0-9]+
|
Trying to register some mboxes, bogofilter keep breaking with:
Invalid buffer size, exiting.
Aborted
Here, the chunk of content:
[cut]
------=_NextPart_000_0005_01C27A09.43DD34E0
Content-Type: application/msword;
name="=?iso-8859-1?Q?CAP=CDTULO_11.doc?="
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="=?iso-8859-1?Q?CAP=CDTULO_11.doc?="
0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAACAAAAzgAAAAAAAAAA
EAAA0AAAAAEAAAD+////AAAAAMwAAADNAAAA////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
[/cut]
Using bogolexer, the last tokens seen was:
Content-Type
application
msword
CAPÍTULO
doc
Content-Transfer-Encoding
base64
Content-Disposition
attachment
I suspect bogofilter is passing the final ?= boundary and trying to
decode all the mime stuff (that is big), and break. Below my change of
the lexer, making it works ok:
ENCODED_TOKEN {BOGOLEX_TOKEN}{0,60}=\?{ID}\?[bq]\?[^[:blank:]\r\n]+\?\=
I added \r and \n to prevent this things happen. I know it's a temporary
remedy, until something more robust comes.
Content-Type
application
msword
CAPÍTULO
doc
Content-Transfer-Encoding
base64
Content-Disposition
attachment
CAPÍTULO
doc
Only this chunk won't break, but sorry, I can't sent the emails, because
they have proprietary material. But it would be easy to reproduce, just
telling someone to sent a Outlock mail with a accented char filename
(big) anexed.
Anyway, these type of decoding would be nicer only in the headers,
because the encoded fields are almost From, Sender, Reply-To, Subject
(fields that goes with 8 bit chars, like that ones that contains names,
etc). It should speed up the parsing, and these encodings in the body
are rarelly found (or I'm wrong?).
After running flex, I got a 3MB lexer_v3.c, is that correct?
Now I will try to re-register the emails, and test a bit more the
decoding.
| >Another question: will bogofilter ever support double tokens store
| >(phrases)? It would improve accuracy, but you don't do it because the
| >performance issue?
|
| Current thought is to get bogofilter doing the best possible job with
| single tokens. After that we can work on phrases.
|
--
Junior
jxz at uol.com.br
http://jxz.dontexist.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20030725/77b50551/attachment.sig>
More information about the Bogofilter
mailing list