effects of lexer changes

Gyepi SAM gyepi at praxis-sw.com
Sat Jan 4 03:19:59 CET 2003


On Sat, Jan 04, 2003 at 01:48:03AM +0100, Matthias Andree wrote:
> Gyepi SAM <gyepi at praxis-sw.com> writes:
> >> Is the nesting depth the correct parameter? I'd think that we ought to
> >> look at the mime type instead. If it's message/rfc822, text/*, we want
> >> to have a look, as these can happen at top level.
> >
> > I hope you meant message/rfc822 and multipart/*.
> > text/* does not nest. Minor issue, but at this stage it is important to
> > agree on these things.
> 
> Well, we can have text/* at any level, including top level. We certainly
> also want to have a look at tokens in the message/rfc822 parts. We might
> not want to look into the tokens of multipart/*, but only of its
> subparts. But then again, we might want to tokenize these as well as
> their subparts, in case there is an indicative MIME prologue (the part
> between the end of the multipart/* headers and the first boundary line,
> that is usually something like "This is a MIME-encapsulated message.").

My plan is to tokenize everything that can be tokenized, include the
prologue, which rfc2045 calls the preamble, and the "postamble",
and only skip unrecognized parts.  So I think we should look at the tokens of
multipart/* and all the recognize subparts. This has the result, for a multipart/alternative message with text and html parts, of essentially doubling the
token count but I don't think that should be cause for concern.

FYI, I had been working with an early version of the mime parser, but stopped
when a flurry of changes causes too many conflicts. I will continue working
on it and will offer it up RSN.

-Gyepi




More information about the bogofilter-dev mailing list