lexer, tokens, and content-types
matthias.andree at gmx.de
Mon Dec 9 06:32:36 EST 2002
On Sun, 08 Dec 2002, David Relson wrote:
> Let's wait to hear Gyepi's opinion. He has used eps in other projects and
> knows more about it than we do.
> I don't expect to find a perfect solution, i.e. one that can just be
> dropped in and will do everything we want. If we can find a good (partial)
> solution, that'd be better than doing it all, by ourselves. I figure we
> have other things to do besides reinvent the wheel, don't we?
Well, I easily lose trust in a foreign project, and I'll have to wonder
if we want to audit that eps code or rather coin our own. I'd happily
use a library, but I'd expect of it that it:
a) knows all MIME encodings
b) doesn't invent its own encodings that are not registered by the IANA
c) decodes quoted-printable and base64
d) does not try to decode fraudulent data like the I-love-you signature
kind of thing. It is commonly seen that people use these bogus
"begin" headers as "begin quoting Joe Sixpack who wrote" and put an
"end" line to defeat Outlook's idiotic quoting. We cannot ignore
these lines and we cannot uudecode either. Outlook has had its
issues, we don't need to copy its bugs. Ignoring the 60-byte uuencode
lines is fine though.
> The task at hand is the third step, which can be further broken down:
a0. detect attachments
> a. decode attachments
> b. normalize character sets (unicode?)
> c. parse via lexer.l
> d. tokenize - header prefixes, subnet tokens, etc
> Given good info from steps a & b, I'm thinking the current parsing (step c)
> is ok. eps is a possible (partial) solution for (a) and iconv a possible
> solution for (b). Don't forget my charset code as a beginning for (b),
> though as I said (above), if somebody else's solution will solve our
> problem, we should consider using their solution. Step d is mostly new
> territory, with needs I don't yet know, though I've heard the spambayes
> project has interesting ideas for that.
A useful break-down.
More information about the Bogofilter-dev