effects of lexer changes
Matthias Andree
matthias.andree at gmx.de
Sat Jan 4 13:17:45 CET 2003
On Fri, 03 Jan 2003, Gyepi SAM wrote:
> My plan is to tokenize everything that can be tokenized, include the
> prologue, which rfc2045 calls the preamble, and the "postamble",
> and only skip unrecognized parts. So I think we should look at the tokens of
> multipart/* and all the recognize subparts. This has the result, for a multipart/alternative message with text and html parts, of essentially doubling the
> token count but I don't think that should be cause for concern.
We should not count the same word twice. It distorts the counts.
> FYI, I had been working with an early version of the mime parser, but stopped
> when a flurry of changes causes too many conflicts. I will continue working
> on it and will offer it up RSN.
Thanks in advance.
--
Matthias Andree
More information about the bogofilter-dev
mailing list