lexer, tokens, and content-types
David Relson
relson at osagesoftware.com
Mon Dec 9 13:38:36 CET 2002
At 06:48 AM 12/9/02, Matthias Andree wrote:
>On Sun, 08 Dec 2002, Gyepi SAM wrote:
>
> > On Sun, Dec 08, 2002 at 08:19:54PM -0500, David Relson wrote:
> >
> > > There is a lot there, but eps isn't complete. If we want
> completeness, we
> > > need to extend their package (perhaps a whole lot) or use another
> package.
> >
> > No it is not complete and I had expected to write new code for it.
> > Yes, we have to write glue code. It is just a library afterall ;)
> >
> > I encourage you to suggest alternatives. This is the time for it.
>
>Either find or rewrite a stripped-down alternative. We don't need all
>that stuff that eps declares, we merely need to figure multiparts and
>messages (for nesting), figure the encoding -- there are the '7bit',
>'8bit', 'binary' identity encodings and the quoted-printable and base64
>encodings -- and decoding them.
>
>Decoding qp is trivial. We can reuse existing code to decode base64. I
>wonder if we find GOOD code to break apart MIME, because we for sure
>must not decode anything unless there's a MIME-Version header, and we
>must heed the boundary lines to prevent running over boundaries and
>decoding the next MIME part which is at its defaults, say,
>7bit-text/plain-charset="US-ASCII".
Like you say, qp is trivial. Base64 and uuencode are easy. Mime is the
difficult one. You've mentioned libgmime. As I haven't researched it, I
ask the simple questions: Is it complete? robust? easy to use?
>Then we must decide on a canonical character set, which is where UTF-8
>is attractive now because it still relies on the "unsigned char" type,
>but uses a variable amount of "char" to represent one character; it
>transparently maps US-ASCII and you know that the 2nd to last octet of a
>UTF-8 character always have their high bit set. It'll confuse our
>MAXTOKENLEN though because UTF-8 might print 6 octets for a single
>character, and that's going to be a little harder.
OK, UTF-8 is transparent of US-ASCII. How compact is it for characters in
European languages? If it does that well, I think it'd work for
bogofilter. I may be taking a parochial view here, but it doesn't bother
me if asian languages have characters that convert to 6 octets and have
tokens that run afoul of MAXTOKENLEN.
>OTOH, I see no reference to "wchar" in flex.info, so I'm unsure if we
>could feed something else than chars to flex, so we might then have to
>replace flex.
>
>When I'd be doing a MIME parser in flex, I'd have two alternatives:
>
>1. eat up the lines that contain the boundary line, charset etc. (so the
> charset name e. g. gets lost as token), or
>
>2. use the REJECT rule to push the lines I extracted information from
> back to the input and have the other lexer rules deal with it again.
> It is documented as slowing the whole lexer down, but I don't know by
> which amount.
Solution 1 is pretty workable. In the current get_token() there's an
if(IPADDR) before the while{} loop and a number of if(class==TOKEN_TYPE)
statements inside the loop. I've had version of the function that use
switch(){}'s instead of the conditionals. By using a variable, e.g.
save_class, to remember the token class, this gives a simple state machine
that _we_ totally control. For example, I used this mechanism at one point
to let flex recognize complete from statements, i.e. "^From\ .*{DATE}$" and
then to reparse the line for the individual tokens.
Of course, I don't yet know about the REJECT mechanism, so there might be a
simpler, better way to skin that cat. There probably is :-)
More information about the bogofilter-dev
mailing list