buffer swapping....Lots of progress.

Tue Feb 25 16:42:20 CET 2003

I *think* This change needs to be made:

    while (msg_header
            && count != -1
            && memcmp(buf,spam_header_name,hdrlen) == 0)
     {

count should change to >= hdrlen ---- else this will compare leftover 
buffer contents.

in lexer.c, I think that count should be >= 5 here.

   if (count > 0
         && memcmp("From ", buf, 5) != 0
This instead?
         && ((count >= 5 && memcmp("From ", buf, 5) != 0) || count < 5)
         && !msg_header && !msg_state->mime_header
         && msg_state->mime_type != MIME_TYPE_UNKNOWN) {
         int decoded_count = mime_decode(buf, count);
         /*change buffer size only if the decoding worked */
         if (decoded_count != 0)
             count = decoded_count;
     }

I think I got it working...at least the buffer swaps are working and I am 
not stepping all over the buffers. It was as simple, once I knew what was 
going on, as delaying the buffer swap until *after* the call to yylex.  I 
moved my swap induction code to get_token, right after the while(true) 
{.  All the basic tests work.  But all the advanced tests fail:

make  check-TESTS
make[1]: Entering directory 
`/home/njs/prod_bogofilter/bogofilter-0.10.3.1/tests/bogofilter'
FAIL: t.lexer.mbx
FAIL: t.robx
SKIP: t.valgrind
./outputs/split.out ./checks.3729.20030225T074011/split.out differ: char 
27, line 1
FAIL: t.split
./outputs/msg.1.f.v ./checks.3804.20030225T074012/tests/msg.1.f.v differ: 
char 53, line 1
FAIL: t.systest
./outputs/grftest.out ./checks.4118.20030225T074014/tests/grftest.out 
differ: char 81, line 2
FAIL: t.grftest
======================
5 of 5 tests failed
(1 tests were not run)
======================

What I am seeing is oddities like this:

*** 88 b,b 1
*** 89 b,b 51 From glamb at dynagen.on.ca  Fri Nov  1 06:20:26 2002
lexer_state: TEXT -> HEAD
*** mime_reset
*** mime_pop. stackp: 0
*** mime_push. stackp: 0
lexer_state: HEAD -> TEXT
*** 90 b,b 35 Return-Path: <glamb at dynagen.on.ca>
*** 91 b,b 33 Delivered-To: smythe at example.com

So something is triggering an extra state change.  Hmmmm.  OK, I figured 
that out. It was noticing a newline while it was still processing the line 
in the text processor.  So you note that you have a pending state change 
and you refuse to do another one.   That is in the patch following.  But...

[njs at glock bogofilter]$ ./t.lexer.mbx -v 2>&1 | less
[njs at glock bogofilter]$ pwd
/home/njs/prod_bogofilter/bogofilter-0.10.3.1/tests/bogofilter
[njs at glock bogofilter]$

 >*** 116 b,b 28 Content-Disposition: inline
 >yyredo:  27 "Content-Disposition: inline"
 >*** 117 b,b 66 In-Reply-To:
 ><4.3.2.7.2.20021031175102.00be09a0 at mail.example.com>
 >*** 118 b,b 41 Organization: Dynagen Consulting Limited
 >*** 119 b,b 1
 >*** 120 b,b 56 On 20021031 (Thu) at 1805:18 -0500, David Smythe wrote:
 >*** 121 b,b 1
 >*** 122 b,b 71 > Also, I'm thinking the beginning of the Intro needs a bit
 >more of a

Note that even though we are parsing header here, it is still flagged as 
b,b, and the discovery of the newline (it is hit) does not therefore switch 
from head to body.  It looks like this is supposed to be done in 
mime_reset, and I think that is being called, but the state is wrong later.

I would appreciate it if someone who has worked on this code can look at it.

The debugging lines still read b,b, and then the *next* state change is not 
made because the got_newline() code is ignored.  I have to go see the 
doctor now, so I have to get going.  >

[njs at glock bogofilter-0.10.3.1]$ for a in *~1~; do b=`basename $a .~1~`; 
diff -u $a $b; done > /home/docroot/docs/njs/demime/bogofilter-0.10.3.1.patch
[njs at glock bogofilter-0.10.3.1]$

http://majordomo.squawk.com/njs/demime/bogofilter-0.10.3.1.patch

The rest is the explanation of the patch:

In "change_lexer_state()", I stash the old and new state on a state change:

static token_t simicich_old_lexer_state = LEXER_HEAD;
static token_t simicich_new_lexer_state = LEXER_HEAD;

static
void change_lexer_state(lexer_state_t new)
{
     /* if change of state, show new state */
     if (DEBUG_LEXER(1) && lexer_state != new)
         fprintf(dbgout, "lexer_state: %s -> %s\n", 
state_name(lexer_state), state_name(new));

     /* start simicich */
     if(lexer_state != new) {    /* No need to swap buffers back to where it */
                                 /* came from - we can shortcut. */
       simicich_old_lexer_state = lexer_state;
       simicich_new_lexer_state = new;
     }
     /* end simicich */
     lexer_state = new;
     return;
}

In get_token:

token_t get_token(void)
{
     token_t class = NONE;
     unsigned char *cp;
     /* simicich */
     void * hold_buffer;
     /* end */

[......]

     while (true) {
       /* start simicich */
       if(simicich_old_lexer_state != simicich_new_lexer_state) {
                                 /* No need to swap buffers back to where it */
                                 /* came from - we can shortcut. */
         switch(simicich_old_lexer_state) {
         case LEXER_HTML:
           hold_buffer = lexer_text_html_extract_current_buffer();
           break;
         case LEXER_TEXT:
           hold_buffer = lexer_text_plain_extract_current_buffer();
           break;
         case LEXER_HEAD:
           hold_buffer = lexer_head_extract_current_buffer();
           break;
         default:
           hold_buffer = NULL;
         }
         switch(simicich_new_lexer_state) {
         case LEXER_HTML:
           lexer_text_html_install_buffer(hold_buffer);
           break;
         case LEXER_HEAD:
           lexer_head_install_buffer(hold_buffer);
           break;
         case LEXER_TEXT:
           lexer_text_plain_install_buffer(hold_buffer);
           break;
         }
         simicich_old_lexer_state = simicich_new_lexer_state;
       }
       /* end simicich */

this was added to near the end of token.h to provide some description of 
calling sequences for the routines..YY_BUFFER_STATE is only defined in the 
lexers, everywhere else we can treat this as a opaque pointer:

/* Added Simicich - support buffer switching */
void * lexer_head_extract_current_buffer( void );
void * lexer_text_html_extract_current_buffer( void );
void * lexer_text_plain_extract_current_buffer( void );
#ifdef FLEX_SCANNER
void lexer_head_install_buffer(YY_BUFFER_STATE);
void lexer_text_html_install_buffer(YY_BUFFER_STATE);
void lexer_text_plain_install_buffer(YY_BUFFER_STATE);
#else
void lexer_head_install_buffer(void *);
void lexer_text_html_install_buffer(void *);
void lexer_text_plain_install_buffer(void *);
#endif

Finally, at the end of each lexer, I have code like this - this is the code 
out of lexer_head.l:

%%

void * lexer_head_extract_current_buffer()
{
   YY_BUFFER_STATE hold = yy_current_buffer;
   yy_switch_to_buffer(yy_create_buffer( yyin, YY_BUF_SIZE ));
   /* yy_delete_buffer(yy_current_buffer); */
   return hold;
}
void lexer_head_install_buffer(YY_BUFFER_STATE b)
{
   YY_BUFFER_STATE hold = yy_current_buffer;
   if(b != hold) {
     yy_delete_buffer(hold);
     yy_switch_to_buffer(b);
   }
   yy_init = 1;
}

/*
  * The following sets edit modes for GNU EMACS
  * Local Variables:
  * mode:c
  * End:
  */

--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally 
to mean electronic messages designed to be read by an individual, and it 
can include Usenet, SMS, AIM, etc.  But if it is not all three of 
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term 
plays into the hands of the spammers, since it causes confusion, and 
spammers thrive on  confusion. Spam is not speech, it is an action, like 
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!