effects of lexer changes
David Relson
relson at osagesoftware.com
Sat Jan 4 03:47:54 CET 2003
At 09:19 PM 1/3/03, Gyepi SAM wrote:
>My plan is to tokenize everything that can be tokenized, include the
>prologue, which rfc2045 calls the preamble, and the "postamble",
>and only skip unrecognized parts. So I think we should look at the tokens of
>multipart/* and all the recognize subparts. This has the result, for a
>multipart/alternative message with text and html parts, of essentially
>doubling the
>token count but I don't think that should be cause for concern.
I think you're right about the doubling not mattering. I'd bet that lots
of messages say the same thing in text/plain and text/html - the difference
being the html formatting. Given the same content, collect_words() will
see 2 copies of each token. Since it removes duplicates, we should be fine.
>FYI, I had been working with an early version of the mime parser, but stopped
>when a flurry of changes causes too many conflicts. I will continue working
>on it and will offer it up RSN.
'At's what I wanted to hear.
David
More information about the bogofilter-dev
mailing list