effects of lexer changes

Tue Dec 31 16:07:01 CET 2002

Matthias,

I've been looking at the regression tests to understand why they're 
failing.  So far I've identified several changes in lexer.l that cause 
output to be different.  Some of the differences are due to changes in 
lexer.l and are correct.  Some of the changes are defects.  I'm testing 
with 0.9.1.2 and the current mime processing lexer.  Token lists were 
generated using bogolexer and message 
tests/t.systest.d/inputs/msg.2.txt.  The comments below apply to this message.

1 - changing MAXTOKENLEN from 20 to 30 adds some tokens, for example 
www.genuinerewards.com.  This is fine.

2 - BOUNDARY tokens are no longer returned by get_token().  Fine.

3 - lexer.l is caseless, so "All" matches pattern "all" and disappears from 
the output.  Fine.

These are all correct.

4 - tokens from mime directives are not being output.  For example, 
"Content-Type: text/plain" used to return 3 tokens.  Looks like yyredo() 
isn't working.  I'll fix it.

5 - The new code to "ignore anything when not reading text MIME types" may 
be over zealous.  In msg.2.txt, it causes the text between the message 
header and the first boundary line to be ignored.  Since there _is_ text 
there, I think we want bogofilter to see it.  What do you think?

The code to ignore tokens shouldn't take effect until bogofilter is in the 
body of a mime part.  I think there's a problem of stack level confusion, 
i.e. the level for setting mime info and the level for using it.

Command "bogolexer -p -x lm -vvv < tests/t.systest.d/inputs/msg.2.txt" 
gives a good view of what's happening.

David