Encoded headers parser

Fri Jul 25 21:59:13 CEST 2003

Junior,

Currently, bogofilter has the following in lexer_v3.l:

BASE64          [0-9a-zA-Z/+=]+
QP              [^[:blank:]]+
ENCODED_TOKEN   {BOGOLEX_TOKEN}*=\?{ID}\?(b\?{BASE64}|q\?{QP})\?\=

It's more specific than what you're running and should do better.  Let me 
know how it goes.

At 12:45 PM 7/25/03, Junior wrote:

...[snip]...

>Only this chunk won't break, but sorry, I can't sent the emails, because
>they have proprietary material. But it would be easy to reproduce, just
>telling someone to sent a Outlock mail with a accented char filename
>(big) anexed.

I don't have ready access to Outlook, so I'll await your test results.

>Anyway, these type of decoding would be nicer only in the headers,
>because the encoded fields are almost From, Sender, Reply-To, Subject
>(fields that goes with 8 bit chars, like that ones that contains names,
>etc). It should speed up the parsing, and these encodings in the body
>are rarelly found (or I'm wrong?).

Using <INITIAL> in the patterns restricts it to headers.

>After running flex, I got a 3MB lexer_v3.c, is that correct?

Yes.  With the pattern above, _my_ .c file is 2,648,183 bytes and the .o 
file is 1,643,852 bytes.

Cheers!

David