lexer, tokens, and content-types

Mon Dec 9 13:48:52 CET 2002

On Mon, 09 Dec 2002, David Relson wrote:

> Like you say, qp is trivial.  Base64 and uuencode are easy.  Mime is the 
> difficult one.  You've mentioned libgmime.  As I haven't researched it, I 
> ask the simple questions:  Is it complete? robust? easy to use?

I haven't looked at libgmime, but there have been issues with Pan (that
uses libgmime) with character sets; I'm not sure if these affect
decoding stuff, but it looks rather large.

> OK, UTF-8 is transparent of US-ASCII.  How compact is it for characters in 
> European languages?

Usually two characters.

> If it does that well, I think it'd work for 
> bogofilter.  I may be taking a parochial view here, but it doesn't bother 
> me if asian languages have characters that convert to 6 octets and have 
> tokens that run afoul of MAXTOKENLEN.

We need to separate "printed characters" from "stored characters"
(octets) then.

> Solution 1 is pretty workable.  In the current get_token() there's an 
> if(IPADDR) before the while{} loop and a number of if(class==TOKEN_TYPE) 
> statements inside the loop.  I've had version of the function that use 
> switch(){}'s instead of the conditionals.  By using a variable, e.g. 
> save_class, to remember the token class, this gives a simple state machine 
> that _we_ totally control.  For example, I used this mechanism at one point 
> to let flex recognize complete from statements, i.e. "^From\ .*{DATE}$" and 
> then to reparse the line for the individual tokens.
> 
> Of course, I don't yet know about the REJECT mechanism, so there might be a 
> simpler, better way to skin that cat.  There probably is :-)

The question is if it matters performance-wise. We don't want to discard
possible strong indicators for Graham mode, do we? :-)

-- 
Matthias Andree