mime processing and multiple lexers
David Relson
relson at osagesoftware.com
Thu Jan 2 01:20:15 CET 2003
Greetings,
I've released code for multiple lexers and nested mime parts. It's in the
mime branch of cvs on sourceforge.
As has been suggested on the list, there are three lexers.
lexer_header.l understands messages headers and mime boundaries.
lexer_text_plain.l understands plain text (as in "Content-Type: text/plain")
lexer_text_html.l understands html (as in "Content-Type: text/html")
The two text chunks are pretty simple and know little outside their
specific realm of action. They do know about "^From ", mime boundaries, ip
addresses, and a few other patterns.
The text_html chunk cooperates with get_token() so that virtually nothing
within an html tag gets back to bogofilter. IIRC, numeric ip addresses
are the one exception.
For nested mime parts, I changed the handling of boundary tags so that
encountering a new (different) boundary tag starts a new mime level. A
repeated boundary tag is recognized as another of multiple mime parts and
processing continues at the current level (after some initialization). An
ending tag pops the current level.
It's all working, as best I can tell.
Now to update the reference outputs for regression testing. All the
spamicity computing tests have changed results because of changes in the
lexer (case insensitivity, handling of mime directives, distinction between
plain text and html, etc, etc).
David
More information about the bogofilter-dev
mailing list