lexer, tokens, and content-types

David Relson relson at osagesoftware.com
Mon Dec 9 13:38:36 CET 2002


At 06:48 AM 12/9/02, Matthias Andree wrote:

>On Sun, 08 Dec 2002, Gyepi SAM wrote:
>
> > On Sun, Dec 08, 2002 at 08:19:54PM -0500, David Relson wrote:
> >
> > > There is a lot there, but eps isn't complete.  If we want 
> completeness, we
> > > need to extend their package (perhaps a whole lot) or use another 
> package.
> >
> > No it is not complete and I had expected to write new code for it.
> > Yes, we have to write glue code. It is just a library afterall ;)
> >
> > I encourage you to suggest alternatives. This is the time for it.
>
>Either find or rewrite a stripped-down alternative. We don't need all
>that stuff that eps declares, we merely need to figure multiparts and
>messages (for nesting), figure the encoding -- there are the '7bit',
>'8bit', 'binary' identity encodings and the quoted-printable and base64
>encodings -- and decoding them.
>
>Decoding qp is trivial. We can reuse existing code to decode base64. I
>wonder if we find GOOD code to break apart MIME, because we for sure
>must not decode anything unless there's a MIME-Version header, and we
>must heed the boundary lines to prevent running over boundaries and
>decoding the next MIME part which is at its defaults, say,
>7bit-text/plain-charset="US-ASCII".

Like you say, qp is trivial.  Base64 and uuencode are easy.  Mime is the 
difficult one.  You've mentioned libgmime.  As I haven't researched it, I 
ask the simple questions:  Is it complete? robust? easy to use?

>Then we must decide on a canonical character set, which is where UTF-8
>is attractive now because it still relies on the "unsigned char" type,
>but uses a variable amount of "char" to represent one character; it
>transparently maps US-ASCII and you know that the 2nd to last octet of a
>UTF-8 character always have their high bit set. It'll confuse our
>MAXTOKENLEN though because UTF-8 might print 6 octets for a single
>character, and that's going to be a little harder.

OK, UTF-8 is transparent of US-ASCII.  How compact is it for characters in 
European languages?  If it does that well, I think it'd work for 
bogofilter.  I may be taking a parochial view here, but it doesn't bother 
me if asian languages have characters that convert to 6 octets and have 
tokens that run afoul of MAXTOKENLEN.

>OTOH, I see no reference to "wchar" in flex.info, so I'm unsure if we
>could feed something else than chars to flex, so we might then have to
>replace flex.
>
>When I'd be doing a MIME parser in flex, I'd have two alternatives:
>
>1. eat up the lines that contain the boundary line, charset etc. (so the
>    charset name e. g. gets lost as token), or
>
>2. use the REJECT rule to push the lines I extracted information from
>    back to the input and have the other lexer rules deal with it again.
>    It is documented as slowing the whole lexer down, but I don't know by
>    which amount.

Solution 1 is pretty workable.  In the current get_token() there's an 
if(IPADDR) before the while{} loop and a number of if(class==TOKEN_TYPE) 
statements inside the loop.  I've had version of the function that use 
switch(){}'s instead of the conditionals.  By using a variable, e.g. 
save_class, to remember the token class, this gives a simple state machine 
that _we_ totally control.  For example, I used this mechanism at one point 
to let flex recognize complete from statements, i.e. "^From\ .*{DATE}$" and 
then to reparse the line for the individual tokens.

Of course, I don't yet know about the REJECT mechanism, so there might be a 
simpler, better way to skin that cat.  There probably is :-)





More information about the bogofilter-dev mailing list