[cvs] bogofilter mime.c,1.1.2.3,1.1.2.4 mime.h,1.1.2.1,1.1.2.2

Mon Dec 30 04:28:19 CET 2002

On Sun, 29 Dec 2002, m-a at users.sourceforge.net wrote:

> Update of /cvsroot/bogofilter/bogofilter
> In directory sc8-pr-cvs1:/tmp/cvs-serv9749
> 
> Modified Files:
>       Tag: mime
> 	mime.c mime.h 
> Log Message:
> Band-aid fix to reset encoding, type and header to defaults when a boundary line is encountered.

I call this a band-aid fix because it does roughly what we want, but it
isn't anywhere near correct. (It still does not support nested MIME,
such as a mail with attachment embedded in a MIME-bounce issued by
Sendmail or Postfix.)

Consider a mail of this structure:

| Mime-Version: 1.0
| Content-Type: multipart/mixed;
| 	boundary="=====================_816767349==_"
| 
| --=====================_816767349==_
| Content-Type: text/plain; charset="us-ascii"; format=flowed
| 
| text here
| 
| --=====================_816767349==_
| Content-Type: application/octet-stream; name="mime.1227.gz";
|  x-mac-type="477A6970"; x-mac-creator="477A6970"
| Content-Transfer-Encoding: base64
| Content-Disposition: attachment; filename="mime.1227.gz"
| 
| base64filehere
| ioC/vmsmAAA=
| --=====================_816767349==_
| Content-Type: text/plain; charset="us-ascii"; format=flowed
| 
| signature here
| 
| --=====================_816767349==_--

I saved one of these mails, bogofilter has a hard time figuring the last
part (it never found the boundary line, and never reset the encoding to
7bit, the default), and not decoding the boundary line is somehow not
sufficient, because I believe the mime.c state and the lexer.l state are
not synchronous, thus, yyinput/yylex reads junk. No proof though.

I believe the decoding belongs after yylex(), and for that purpose, we
need two lexers. One lexer (L1) that understands mime, decodes and
suppresses non-text/* MIME parts (maybe lets message/rfc822 through
though, not currently implemented), possibly feed stuff through recode
or iconv, and one lexer (L2) (our traditional) to tokenize, which need
not know anything about MIME. If we want to treat HTML, we need another
one (L3) that strikes before L2 and kills comments and white-on-white or
other low-contrast text to avoid tokenizing invisible sections that
cheat Bayesian filters.

BTW: it's not exactly helpful to mix parsing boundary= parameters and
--BOUNDARY treatment in the same function, it makes the API ugly. The
boundary= treatment belongs into the Content-Type: parser, regretfully,
I didn't manage to finish the code before Christmas.

Opinions?

-- 
Matthias Andree