mime processing and multiple lexers

David Relson relson at osagesoftware.com
Thu Jan 2 01:20:15 CET 2003


Greetings,

I've released code for multiple lexers and nested mime parts.  It's in the 
mime branch of cvs on sourceforge.

As has been suggested on the list, there are three lexers.

	lexer_header.l understands messages headers and mime boundaries.
	lexer_text_plain.l understands plain text (as in "Content-Type: text/plain")
	lexer_text_html.l understands html (as in "Content-Type: text/html")

The two text chunks are pretty simple and know little outside their 
specific realm of action.  They do know about "^From ", mime boundaries, ip 
addresses, and a few other patterns.

The text_html chunk cooperates with get_token() so that virtually nothing 
within an html tag gets back to bogofilter.   IIRC, numeric ip addresses 
are the one exception.

For nested mime parts, I changed the handling of boundary tags so that 
encountering a new (different) boundary tag starts a new mime level.  A 
repeated boundary tag is recognized as another of multiple mime parts and 
processing continues at the current level (after some initialization).  An 
ending tag pops the current level.

It's all working, as best I can tell.

Now to update the reference outputs for regression testing.  All the 
spamicity computing tests have changed results because of changes in the 
lexer (case insensitivity, handling of mime directives, distinction between 
plain text and html, etc, etc).

David





More information about the bogofilter-dev mailing list