effects of lexer changes

David Relson relson at osagesoftware.com
Sat Jan 4 14:26:32 CET 2003


At 07:17 AM 1/4/03, Matthias Andree wrote:

>On Fri, 03 Jan 2003, Gyepi SAM wrote:
>
> > My plan is to tokenize everything that can be tokenized, include the
> > prologue, which rfc2045 calls the preamble, and the "postamble",
> > and only skip unrecognized parts.  So I think we should look at the 
> tokens of
> > multipart/* and all the recognize subparts. This has the result, for a 
> multipart/alternative message with text and html parts, of essentially 
> doubling the
> > token count but I don't think that should be cause for concern.
>
>We should not count the same word twice. It distorts the counts.
>
> > FYI, I had been working with an early version of the mime parser, but 
> stopped
> > when a flurry of changes causes too many conflicts. I will continue working
> > on it and will offer it up RSN.
>
>Thanks in advance.

No problemo.

There's a max_repeats parameter which applies here.  Graham uses the value 
4 while Robinson and Robinson-Fisher use value 1.  The first part of 
classification and registration is to call collect_words() which is where 
get_token() and the lexer are invoked.  Collect_words() applies the 
max_repeats value.  Assuming a mime message has the same content as 
text/plain and text/html, the lexer will see each word twice, and 
get_token() will return each word twice, and collect_words() will apply 
max_repeats and remove the duplicates.

David





More information about the bogofilter-dev mailing list