multiple lexers. (was [cvs] bogofilter mime.c,1.1.2.3,1.1.2.4 mime.h,1.1.2.1,1.1.2.2)

Mon Dec 30 05:26:36 CET 2002

On Mon, Dec 30, 2002 at 04:28:19AM +0100, Matthias Andree wrote:

Yes. If we are going to do MIME, it needs to be able to handle
all valid mime constructs
(in addition to many invalid ones whenever possible);

> I believe the decoding belongs after yylex(), and for that purpose, we
> need two lexers. One lexer (L1) that understands mime, decodes and
> suppresses non-text/* MIME parts (maybe lets message/rfc822 through
> though, not currently implemented), possibly feed stuff through recode
> or iconv, and one lexer (L2) (our traditional) to tokenize, which need
> not know anything about MIME. If we want to treat HTML, we need another
> one (L3) that strikes before L2 and kills comments and white-on-white or
> other low-contrast text to avoid tokenizing invisible sections that
> cheat Bayesian filters.

This is a good idea. At the very least, it will simplify and focus
the various pieces of code.

> 
> BTW: it's not exactly helpful to mix parsing boundary= parameters and
> --BOUNDARY treatment in the same function, it makes the API ugly. The
> boundary= treatment belongs into the Content-Type: parser, 

I have actually changed this in my private copy...

> regretfully,
> I didn't manage to finish the code before Christmas.

Yes, I too started a parser, but did not get far enough with it.
I was aiming for a general purpose mime library, and will probably 
complete it at some point.
Thankfully David started one.

> Opinions?

Yes, thanks.

-Gyepi