lexer, tokens, and content-types

Mon Dec 9 13:26:27 CET 2002

At 06:48 AM 12/9/02, Matthias Andree wrote:

>On Sun, 08 Dec 2002, Gyepi SAM wrote:
>
> > On Sun, Dec 08, 2002 at 08:19:54PM -0500, David Relson wrote:
> >
> > > There is a lot there, but eps isn't complete.  If we want 
> completeness, we
> > > need to extend their package (perhaps a whole lot) or use another 
> package.
> >
> > No it is not complete and I had expected to write new code for it.
> > Yes, we have to write glue code. It is just a library afterall ;)
> >
> > I encourage you to suggest alternatives. This is the time for it.
>
>Either find or rewrite a stripped-down alternative. We don't need all
>that stuff that eps declares, we merely need to figure multiparts and
>messages (for nesting), figure the encoding -- there are the '7bit',
>'8bit', 'binary' identity encodings and the quoted-printable and base64
>encodings -- and decoding them.
>
>Decoding qp is trivial. We can reuse existing code to decode base64. I
>wonder if we find GOOD code to break apart MIME, because we for sure
>must not decode anything unless there's a MIME-Version header, and we
>must heed the boundary lines to prevent running over boundaries and
>decoding the next MIME part which is at its defaults, say,
>7bit-text/plain-charset="US-ASCII".
>
>Then we must decide on a canonical character set, which is where UTF-8
>is attractive now because it still relies on the "unsigned char" type,
>but uses a variable amount of "char" to represent one character; it
>transparently maps US-ASCII and you know that the 2nd to last octet of a
>UTF-8 character always have their high bit set. It'll confuse our
>MAXTOKENLEN though because UTF-8 might print 6 octets for a single
>character, and that's going to be a little harder.
>
>OTOH, I see no reference to "wchar" in flex.info, so I'm unsure if we
>could feed something else than chars to flex, so we might then have to
>replace flex.
>
>When I'd be doing a MIME parser in flex, I'd have two alternatives:
>
>1. eat up the lines that contain the boundary line, charset etc. (so the
>    charset name e. g. gets lost as token), or
>
>2. use the REJECT rule to push the lines I extracted information from
>    back to the input and have the other lexer rules deal with it again.
>    It is documented as slowing the whole lexer down, but I don't know by
>    which amount.
>
>--
>Matthias Andree
>
>---------------------------------------------------------------------
>FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
>To unsubscribe, e-mail: bogofilter-dev-unsubscribe at aotto.com
>For summary digest subscription: bogofilter-dev-digest-subscribe at aotto.com
>For more commands, e-mail: bogofilter-dev-help at aotto.com