effects of lexer changes
David Relson
relson at osagesoftware.com
Sat Jan 4 14:26:32 CET 2003
At 07:17 AM 1/4/03, Matthias Andree wrote:
>On Fri, 03 Jan 2003, Gyepi SAM wrote:
>
> > My plan is to tokenize everything that can be tokenized, include the
> > prologue, which rfc2045 calls the preamble, and the "postamble",
> > and only skip unrecognized parts. So I think we should look at the
> tokens of
> > multipart/* and all the recognize subparts. This has the result, for a
> multipart/alternative message with text and html parts, of essentially
> doubling the
> > token count but I don't think that should be cause for concern.
>
>We should not count the same word twice. It distorts the counts.
>
> > FYI, I had been working with an early version of the mime parser, but
> stopped
> > when a flurry of changes causes too many conflicts. I will continue working
> > on it and will offer it up RSN.
>
>Thanks in advance.
No problemo.
There's a max_repeats parameter which applies here. Graham uses the value
4 while Robinson and Robinson-Fisher use value 1. The first part of
classification and registration is to call collect_words() which is where
get_token() and the lexer are invoked. Collect_words() applies the
max_repeats value. Assuming a mime message has the same content as
text/plain and text/html, the lexer will see each word twice, and
get_token() will return each word twice, and collect_words() will apply
max_repeats and remove the duplicates.
David
More information about the bogofilter-dev
mailing list