Encoded headers parser
David Relson
relson at osagesoftware.com
Fri Jul 25 21:59:13 CEST 2003
Junior,
Currently, bogofilter has the following in lexer_v3.l:
BASE64 [0-9a-zA-Z/+=]+
QP [^[:blank:]]+
ENCODED_TOKEN {BOGOLEX_TOKEN}*=\?{ID}\?(b\?{BASE64}|q\?{QP})\?\=
It's more specific than what you're running and should do better. Let me
know how it goes.
At 12:45 PM 7/25/03, Junior wrote:
...[snip]...
>Only this chunk won't break, but sorry, I can't sent the emails, because
>they have proprietary material. But it would be easy to reproduce, just
>telling someone to sent a Outlock mail with a accented char filename
>(big) anexed.
I don't have ready access to Outlook, so I'll await your test results.
>Anyway, these type of decoding would be nicer only in the headers,
>because the encoded fields are almost From, Sender, Reply-To, Subject
>(fields that goes with 8 bit chars, like that ones that contains names,
>etc). It should speed up the parsing, and these encodings in the body
>are rarelly found (or I'm wrong?).
Using <INITIAL> in the patterns restricts it to headers.
>After running flex, I got a 3MB lexer_v3.c, is that correct?
Yes. With the pattern above, _my_ .c file is 2,648,183 bytes and the .o
file is 1,643,852 bytes.
Cheers!
David
More information about the Bogofilter
mailing list