effects of lexer changes

David Relson relson at osagesoftware.com
Sat Jan 4 03:47:54 CET 2003


At 09:19 PM 1/3/03, Gyepi SAM wrote:

>My plan is to tokenize everything that can be tokenized, include the
>prologue, which rfc2045 calls the preamble, and the "postamble",
>and only skip unrecognized parts.  So I think we should look at the tokens of
>multipart/* and all the recognize subparts. This has the result, for a 
>multipart/alternative message with text and html parts, of essentially 
>doubling the
>token count but I don't think that should be cause for concern.

I think you're right about the doubling not mattering.  I'd bet that lots 
of messages say the same thing in text/plain and text/html - the difference 
being the html formatting.  Given the same content, collect_words() will 
see 2 copies of each token.  Since it removes duplicates, we should be fine.

>FYI, I had been working with an early version of the mime parser, but stopped
>when a flurry of changes causes too many conflicts. I will continue working
>on it and will offer it up RSN.

'At's what I wanted to hear.

David





More information about the bogofilter-dev mailing list