lexer, tokens, and content-types
Matthias Andree
matthias.andree at gmx.de
Mon Dec 9 13:48:52 CET 2002
On Mon, 09 Dec 2002, David Relson wrote:
> Like you say, qp is trivial. Base64 and uuencode are easy. Mime is the
> difficult one. You've mentioned libgmime. As I haven't researched it, I
> ask the simple questions: Is it complete? robust? easy to use?
I haven't looked at libgmime, but there have been issues with Pan (that
uses libgmime) with character sets; I'm not sure if these affect
decoding stuff, but it looks rather large.
> OK, UTF-8 is transparent of US-ASCII. How compact is it for characters in
> European languages?
Usually two characters.
> If it does that well, I think it'd work for
> bogofilter. I may be taking a parochial view here, but it doesn't bother
> me if asian languages have characters that convert to 6 octets and have
> tokens that run afoul of MAXTOKENLEN.
We need to separate "printed characters" from "stored characters"
(octets) then.
> Solution 1 is pretty workable. In the current get_token() there's an
> if(IPADDR) before the while{} loop and a number of if(class==TOKEN_TYPE)
> statements inside the loop. I've had version of the function that use
> switch(){}'s instead of the conditionals. By using a variable, e.g.
> save_class, to remember the token class, this gives a simple state machine
> that _we_ totally control. For example, I used this mechanism at one point
> to let flex recognize complete from statements, i.e. "^From\ .*{DATE}$" and
> then to reparse the line for the individual tokens.
>
> Of course, I don't yet know about the REJECT mechanism, so there might be a
> simpler, better way to skin that cat. There probably is :-)
The question is if it matters performance-wise. We don't want to discard
possible strong indicators for Graham mode, do we? :-)
--
Matthias Andree
More information about the bogofilter-dev
mailing list