lexer, tokens, and content-types

Matthias Andree matthias.andree at gmx.de
Mon Dec 9 12:48:06 CET 2002


On Sun, 08 Dec 2002, Gyepi SAM wrote:

> On Sun, Dec 08, 2002 at 08:19:54PM -0500, David Relson wrote:
> 
> > There is a lot there, but eps isn't complete.  If we want completeness, we 
> > need to extend their package (perhaps a whole lot) or use another package.
> 
> No it is not complete and I had expected to write new code for it.
> Yes, we have to write glue code. It is just a library afterall ;)
> 
> I encourage you to suggest alternatives. This is the time for it.

Either find or rewrite a stripped-down alternative. We don't need all
that stuff that eps declares, we merely need to figure multiparts and
messages (for nesting), figure the encoding -- there are the '7bit',
'8bit', 'binary' identity encodings and the quoted-printable and base64
encodings -- and decoding them.

Decoding qp is trivial. We can reuse existing code to decode base64. I
wonder if we find GOOD code to break apart MIME, because we for sure
must not decode anything unless there's a MIME-Version header, and we
must heed the boundary lines to prevent running over boundaries and
decoding the next MIME part which is at its defaults, say,
7bit-text/plain-charset="US-ASCII".

Then we must decide on a canonical character set, which is where UTF-8
is attractive now because it still relies on the "unsigned char" type,
but uses a variable amount of "char" to represent one character; it
transparently maps US-ASCII and you know that the 2nd to last octet of a
UTF-8 character always have their high bit set. It'll confuse our
MAXTOKENLEN though because UTF-8 might print 6 octets for a single
character, and that's going to be a little harder.

OTOH, I see no reference to "wchar" in flex.info, so I'm unsure if we
could feed something else than chars to flex, so we might then have to
replace flex.

When I'd be doing a MIME parser in flex, I'd have two alternatives:

1. eat up the lines that contain the boundary line, charset etc. (so the
   charset name e. g. gets lost as token), or

2. use the REJECT rule to push the lines I extracted information from
   back to the input and have the other lexer rules deal with it again.
   It is documented as slowing the whole lexer down, but I don't know by
   which amount.

-- 
Matthias Andree




More information about the bogofilter-dev mailing list