effects of lexer changes

Tue Dec 31 15:53:26 CET 2002

Matthias,

I've been looking at the regression tests to understand why they're 
failing.  So far I've identified several changes in lexer.l that cause 
output to be different.  FWIW, I'm testing with 0.9.1.2 and the current 
mime processing lexer.  The comments below apply to bogolexer and message 
tests/t.systest.d/inputs/msg.2.txt.

1 - changing MAXTOKENLEN from 20 to 30 adds some tokens, for example 
www.genuinerewards.com.  This is fine.

2 - BOUNDARY tokens are no longer returned by get_token().  Fine.

3 - lexer.l is caseless, so "All" matches pattern "all" and disappears from 
the output.  Fine.

These are all correct.

4 - tokens from mime directives are not being output.  For example, 
"Content-Type: text/plain" used to return 3 tokens.  Looks like yyredo() 
isn't working.  I'll fix it.

5 - The new code to "ignore anything when not reading text MIME types" may 
be over zealous.  In msg.2.txt, it causes the text between the message 
header and the first boundary line to be ignored.  Since there _is_ text 
there, I think we want bogofilter to see it.  What do you think?

The code to ignore tokens shouldn't take effect until bogofilter is in the 
body of a mime part.  I think there's a problem of stack level confusion, 
i.e. the level for setting mime info and the level for using it.

Command "bogolexere -p -x lm -vvv < tests/t.systest.d/inputs/msg.2.txt" 
gives a good view of what's happening.

David