effects of lexer changes
gyepi at praxis-sw.com
Fri Jan 3 21:19:59 EST 2003
On Sat, Jan 04, 2003 at 01:48:03AM +0100, Matthias Andree wrote:
> Gyepi SAM <gyepi at praxis-sw.com> writes:
> >> Is the nesting depth the correct parameter? I'd think that we ought to
> >> look at the mime type instead. If it's message/rfc822, text/*, we want
> >> to have a look, as these can happen at top level.
> > I hope you meant message/rfc822 and multipart/*.
> > text/* does not nest. Minor issue, but at this stage it is important to
> > agree on these things.
> Well, we can have text/* at any level, including top level. We certainly
> also want to have a look at tokens in the message/rfc822 parts. We might
> not want to look into the tokens of multipart/*, but only of its
> subparts. But then again, we might want to tokenize these as well as
> their subparts, in case there is an indicative MIME prologue (the part
> between the end of the multipart/* headers and the first boundary line,
> that is usually something like "This is a MIME-encapsulated message.").
My plan is to tokenize everything that can be tokenized, include the
prologue, which rfc2045 calls the preamble, and the "postamble",
and only skip unrecognized parts. So I think we should look at the tokens of
multipart/* and all the recognize subparts. This has the result, for a multipart/alternative message with text and html parts, of essentially doubling the
token count but I don't think that should be cause for concern.
FYI, I had been working with an early version of the mime parser, but stopped
when a flurry of changes causes too many conflicts. I will continue working
on it and will offer it up RSN.
More information about the Bogofilter-dev