## **Code Scheduling** ## Outline - · Modern architectures - · Delay slots - · Introduction to instruction scheduling - List scheduling - · Resource constraints - · Interaction with register allocation - Scheduling across basic blocks - · Trace scheduling - · Scheduling for loops - · Loop unrolling - · Software pipelining Sagonas 2 Spring 2006 ## Simple Machine Model - Instructions are executed in sequence - Fetch, decode, execute, store results - One instruction at a time - For branch instructions, start fetching from a different location if needed - Check branch condition - Next instruction may come from a new location given by the branch instruction Kostls Sagonas 3 Spring 2006 ## Simple Execution Model 5 Stage pipe-line | fetch | decode | execute | memory | write back | |-------|--------|---------|--------|------------| |-------|--------|---------|--------|------------| Fetch: get the next instruction Decode: figure out what that instruction is Execute: perform ALU operation address calculation in a memory operation Memory: do the memory access in a mem. op. Write Back: write the results back is Sagonas 4 Spring 2006 #### **Execution Models** time Model 1 Inst 1 EXE MEM WB Inst 2 IF DE EXE MEM MEM Inst 1 EXE Model 2 IF MEM WB DE EXE Inst 2 Inst 3 EXE MEM WB Inst 4 EXE MEM WB Inst 5 IF DE EXE MEM WB ## Outline - · Modern architectures - Delay slots - · Introduction to instruction scheduling - List scheduling - Resource constraints - Interaction with register allocation - · Scheduling across basic blocks - · Trace scheduling - Scheduling for loops - · Loop unrolling - Software pipelining stis Sagonas 6 Spring 200 ## **Handling Branch Instructions** Problem: We do not know the location of the next instruction until later - after DE in jump instructions - after EXE in conditional branch instructions ## Handling Branch Instructions What to do with the middle 2 instructions? - 1. Stall the pipeline in case of a branch until we know the address of the next instruction - wasted cycles ## **Handling Branch Instructions** What to do with the middle 2 instructions? - 2. Delay the action of the branch - Make branch affect only after two instructions - Following two instructions after the branch get executed regardless of the branch ## Branch Delay Slot(s) MIPS has a branch delay slot - The instruction after a conditional branch gets executed even if the code branches to target - Fetching from the branch target takes place only after that | ble | r3, | foo | | |-----|-----|-----|-------------------| | | | | Branch delay slot | What instruction to put in the branch delay slot? ## Filling the Branch Delay Slot Simple Solution: Put a no-op Wasted instruction, just like a stall Branch delay slot ## Filling the Branch Delay Slot Move an instruction from above the branch - moved instruction executes iff branch executes - So, get the instruction from the same basic block as the branch - don't move a branch instruction! - instruction needs to be moved over the branch - branch does not depend on the result of the instr. ## Filling the Branch Delay Slot Move an instruction dominated by the branch instruction ## Filling the Branch Delay Slot Move an instruction from the branch target - Instruction dominated by target - No other ways to reach target (if so, take care of them) - If conditional branch, the moved instruction should not have a lasting effect if the branch is not taken ## Load Delay Slots Problem: Results of the loads are not available until end of MEM stage If the value of the load is used...what to do?? Kostis Sagonas 15 Spring 2006 ## Load Delay Slots If the value of the load is used...what to do?? - Always stall one cycle - Stall one cycle if next instruction uses the value - Need hardware to do this - · Have a delay slot for load - The new value is only available after two instructions - If next instr. uses the register, it will get the old value ## Example $$r2 = *(r1 + 4)$$ $r3 = *(r1 + 8)$ $r4 = r2 + r3$ $r5 = r2 - 1$ goto L1 ostis Sagonas 17 Spring 2006 ## Example Assume 1 cycle delay on branches and 1 cycle latency for loads Kostis Sagonas 18 Spring 2006 ## Example Assume 1 cycle delay on branches and 1 cycle latency for loads Kostis Sagonas 19 Spring 2006 ## Example $$r2 = *(r1 + 4)$$ $r3 = *(r1 + 8)$ $r5 = r2 - 1$ Assume 1 cycle delay on branches and 1 cycle latency for loads Kostis Sagonas Spring 2006 ## Example r2 = \*(r1 + 4) r3 = \*(r1 + 8) r5 = r2 - 1 goto L1 r4 = r2 + r3 Final code after delay slot filling Kostis Sagonas 21 Spring 200 ## Outline - Modern architectures - · Delay slots - Introduction to instruction scheduling - · List scheduling - · Resource constraints - Interaction with register allocation - Scheduling across basic blocks - Trace scheduling - · Scheduling for loops - · Loop unrolling - · Software pipelining Sagonas Real Machine Model cont. # From a Simple Machine Model to a Real Machine Model - Many pipeline stages - MIPS R4000 has 8 stages - Different instructions take different amount of time to execute mult 10 cyclesdiv 69 cyclesddiv 133 cycles • Hardware to stall the pipeline if an instruction uses a result that is not ready Kostis Sagonas 23 Spring 2006 - Most modern processors have multiple execution units (superscalar) - If the instruction sequence is correct, multiple operations will take place in the same cycles - Even more important to have the right instruction sequence Kostis Sagonas 24 Spring 2006 ## **Instruction Scheduling** Goal: Reorder instructions so that pipeline stalls are minimized Constraints on Instruction Scheduling: - Data dependencies - Control dependencies - Resource constraints Kostis Sagonas 25 Spring 2006 ## Data Dependencies - If two instructions access the same variable, they can be dependent - Kinds of dependencies - True: write $\rightarrow$ read - Anti: read → write - Output: write $\rightarrow$ write - What to do if two instructions are dependent? - The order of execution cannot be reversed - Reduces the possibilities for scheduling s Sagonas 26 Spring 2006 ## Computing Data Dependencies - For basic blocks, compute dependencies by walking through the instructions - Identifying register dependencies is simple - is it the same register? - For memory accesses - simple: base + offset1 ?= base + offset2 - data dependence analysis: a[2i] ?= a[2i+1] - interprocedural analysis: global ?= parameter - pointer alias analysis: p1 ?= p Kostis Sagonas 27 Spring 200 ## Representing Dependencies - Using a dependence DAG, one per basic block - Nodes are instructions, edges represent dependencies Edge is labeled with latency: $v(i \rightarrow j)$ = delay required between initiation times of i and j minus the execution time required by i Kostis Sagonas 28 Spring 2006 ## Example 1: $$r2 = *(r1 + 4)$$ $$2: r3 = *(r2 + 4)$$ 3: r4 = r2 + r3 4: r5 = r2 - 1 Kostis Sagonas 29 Spring 2006 ## Another Example 1: $$r2 = *(r1 + 4)$$ $$2: *(r1 + 4) = r3$$ 3: r3 = r2 + r3 4: r5 = r2 - 1 Kostis Sagonas 30 Spring 2006 # Control Dependencies and Resource Constraints - · For now, let's worry only about basic blocks - For now, let's look at simple pipelines Kostis Sagonas 31 Spring 2006 ``` Example Results available in 1 cycle r1, array 2: LD r2,4(r1) 1 cycle 3: AND r3,r3,0x00FF 1 cycle r6,r6,100 4: MULC 3 cycles 5: ST r7,4(r6) r5, r5, 100 4 cycles 6: DIVC 7: ADD r4, r2, r5 1 cycle 8: MUL r5, r2, r4 3 cycles r4,0(r1) 1 2 ``` ``` Example Results available in 1: LA r1,array 1 cycle 2: LD r2,4(r1) 1 cycle 3: AND r3,r3,0x00FF 1 cycle r6,r6,100 3 cycles 4 · MUTC 5: ST r7,4(r6) 6: DIVC r5, r5, 100 4 cycles 7: ADD r4, r2, r5 1 cycle 8: MUL r5, r2, r4 3 cycles 9: ST r4,0(r1) 4 2 3 st st ``` ``` Example Results available in 1: LA r1,array 1 cycle r2,4(r1) 2: LD 1 cycle r3,r3,0x00FF 3: AND 1 cycle 4: MULC r6,r6,100 3 cycles r7,4(r6) 5: ST 6: DIVC r5, r5, 100 4 cycles 7: ADD r4,r2,r5 1 cycle 3 cycles 8: MUL r5, r2, r4 r4,0(r1) 5 st st 6 st st ``` - · Modern architectures - · Delay slots - · Introduction to instruction scheduling - List scheduling - · Resource constraints - · Interaction with register allocation - · Scheduling across basic blocks - · Trace scheduling - Scheduling for loops - Loop unrolling - · Software pipelining Kostis Sagonas 37 Spring 2006 ## List Scheduling Algorithm - Idea - Do a topological sort of the dependence DAG - Consider when an instruction can be scheduled without causing a stall - Schedule the instruction if it causes no stall and all its predecessors are already scheduled - Optimal list scheduling is NP-complete - Use heuristics when necessary s Sagonas 38 Spring 200 ## List Scheduling Algorithm - Create a dependence DAG of a basic block - Topological Sort READY = nodes with no predecessors Loop until READY is empty Schedule each node in READY when no stalling READY += nodes whose predecessors have all been scheduled Kostis Sagonas 39 ## Heuristics for selection Heuristics for selecting from the READY list - 1. pick the node with the longest path to a leaf in the dependence graph - 2. pick a node with the most immediate successors - 3. pick a node that can go to a less busy pipeline (in a superscalar implementation) Kostis Sagonas 40 Spring 2006 ## Heuristics for selection Pick the node with the longest path to a leaf in the dependence graph Algorithm (for node x) - If x has no successors $d_x = 0$ - $-d_x = MAX(d_y + c_{xy})$ for all successors y of x Use reverse breadth-first visiting order Kostis Sagonas 41 Spring 20 ## Heuristics for selection Pick a node with the most immediate successors Algorithm (for node x): $-f_x =$ number of successors of x Kostis Sagonas 42 Spring 2006 - · Modern architectures - Delay slots - · Introduction to instruction scheduling - List scheduling - · Resource constraints - · Interaction with register allocation - · Scheduling across basic blocks - · Trace scheduling - · Scheduling for loops - · Loop unrolling - · Software pipelining Sagonas 68 Spring 20 ## **Resource Constraints** - Modern machines have many resource constraints - Superscalar architectures: - can run few parallel operations - but have constraints Kostis Sagonas 69 Spring 2006 # Resource Constraints of a Superscalar Processor #### Example: - 1 integer operation ALUop dest, src1, src2 # in 1 clock cycle In parallel with - 1 memory operation LD dst, addr # in 2 clock cycles ST src, addr # in 1 clock cycle Kostis Sagonas 70 Spring 2006 # List Scheduling Algorithm with Resource Constraints - Represent the superscalar architecture as multiple pipelines - Each pipeline represents some resource Kostis Sagonas 71 Spring 2006 # List Scheduling Algorithm with Resource Constraints - Represent the superscalar architecture as multiple pipelines - Each pipeline represents some resource - Example - One single cycle ALU unit - One two-cycle pipelined memory unit | | | <br>• • | | | |-------|--|---------|--|--| | ALUop | | | | | | MEM 1 | | | | | | MEM 2 | | | | | Spring 2006 # List Scheduling Algorithm with Resource Constraints - Create a dependence DAG of a basic block - · Topological Sort READY = nodes with no predecessors Loop until READY is empty Let $n \in READY$ be the node with the highest priority Schedule n in the earliest slot that satisfies precedence + resource constraints Update READY Kostis Sagonas 73 Spring 2006 - · Modern architectures - · Delay slots - Introduction to instruction scheduling - List scheduling - · Resource constraints - Interaction with register allocation - · Scheduling across basic blocks - · Trace scheduling - · Scheduling for loops - · Loop unrolling - · Software pipelining # Register Allocation and Instruction Scheduling If register allocation is performed before instruction scheduling – the choices for scheduling are restricted Kooth Segonus 89 Spring 2006 # Register Allocation and Instruction Scheduling - If register allocation is performed before instruction scheduling - the choices for scheduling are restricted - If instruction scheduling is performed before register allocation - register allocation may spill registers - will change the carefully done schedule!!! Kostis Sagonas 93 Spring 2006 ## Outline - · Modern architectures - · Delay slots - Introduction to instruction scheduling - · List scheduling - · Resource constraints - Interaction with register allocation - · Scheduling across basic blocks - · Trace scheduling - · Scheduling for loops - · Loop unrolling - · Software pipelining Sagonas 94 Spring 200 ## Scheduling across basic blocks - Number of instructions in a basic block is small - Cannot keep a multiple units with long pipelines busy by just scheduling within a basic block - Need to handle control dependencies - Scheduling constraints across basic blocks - Scheduling policy Kostis Sagonas 95 Spring 2006 ## Moving across basic blocks Downward to adjacent basic block A path to **B** that does not execute **A**? Kostis Sagonas 96 Spring 2006 Upward to adjacent basic block A path from C that does not reach A? Kostis Sagonas 97 Spring 2006 ## Outline - · Modern architectures - · Delay slots - Introduction to instruction scheduling - · List scheduling - · Resource constraints - Interaction with register allocation - · Scheduling across basic blocks - Trace scheduling - · Scheduling for loops - Loop unrolling - Software pipelining Kostis Sagonas 99 Spring 2006 # Trace Scheduling - Find the most common trace of basic blocks - Use profile information - Combine the basic blocks in the trace and schedule them as one block - Create compensating (clean-up) code if the execution goes off-trace Kostis Sagonas 100 Spring 2006 - Modern architectures - Delay slots - Introduction to instruction scheduling - List scheduling - Resource constraints - · Interaction with register allocation - Scheduling across basic blocks - Trace scheduling - Scheduling for loops - Loop unrolling - · Software pipelining Kostis Sagonas 107 Spring 2006 # Scheduling for Loops - Loop bodies are typically small - But a lot of time is spend in loops due to their iterative nature - Need better ways to schedule loops Kostis Sagonas 108 Spring 2006 ## Loop Example #### Machine: - One load/store unit - · load 2 cycles - store 2 cycles - Two arithmetic units - add 2 cycles - branch 2 cycles (no delay slot) - multiply 3 cycles - Both units are pipelined (initiate one op each cycle) Kostis Sagonas 109 Spi ## Loop Example #### Source Code ``` for i = 1 to N A[i] = A[i] * b ``` #### Assembly Code ``` loop: ld r6, (r2) mul r6, r6, r3 st r6, (r2) add r2, r2, 4 ble r2, r5, loop ``` Kostis Sagonas 110 Spring 2006 ## Loop Example ## Assembly Code loop: ld r6, (r2) mul r6, r6, r3 st r6, (r2) add r2, r2, 4 ble r2, r5, loop ### Schedule (9 cycles per iteration) | | ld | | | st | | | | | |------|------------|--|--|-----|--|--|--------|---------| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | add | | | | | | Kost | is Sagonas | | | 111 | | | Spring | 2006 | | Kost | is Sagonas | | | 111 | | | S | pring : | ## Outline - · Modern architectures - · Delay slots - · Introduction to instruction scheduling - · List scheduling - · Resource constraints - Interaction with register allocation - · Scheduling across basic blocks - · Trace scheduling - Scheduling for loops - · Loop unrolling - Software pipelining Sagonas 112 Spring 2006 ## Loop Unrolling Oldest compiler trick of the trade: Unroll the loop body a few times #### Pros - Creates a much larger basic block for the body - Eliminates few loop bounds checks #### Cons: - Much larger program - Setup code (# of iterations < unroll factor) - Beginning and end of the schedule can still have unused slots Kostis Sagonas 113 Spring 2006 # Loop Example loop: ld r6, (r2) mul r6, r6, r3 st r6, (r2) add r2, r2, 4 ble r2, r5, loop Schedule (8 cycles per iteration) Loop: ld r6, (r2) mul r6, r6, r3 st r6, (r2) add r2, r2, 4 ld r6, (r2) mul r6, r6, r3 st r6, (r2) add r2, r2, 4 ble r2, r5, loop ## Loop Unrolling - Rename registers - Use different registers in different iterations Kostis Sagonas 115 Spring 2006 # Loop Example ``` loop: ld r6, (r2) mul r6, r6, r3 st r6, (r2) add r2, r2, 4 ld r6, (r2) mul r6, r6, r3 st r6, (r2) add r2, r2, 4 ble r2, r5, loop ``` ``` loop: ld r6, (r2) mul r6, r6, r3 st r6, (r2) add r2, r2, 4 ld r7, (r2) mul r7, r7, r3 st r7, (r2) add r2, r2, 4 ble r2, r5, loop ``` stis Sagonas 116 Spring 2006 ## Loop Unrolling - · Rename registers - Use different registers in different iterations - Eliminate unnecessary dependencies - again, use more registers to eliminate true, anti and output dependencies - eliminate dependent-chains of calculations when possible Kostis Sagonas 117 Spring 2006 ## Loop Example ``` loop: ld r6, (r2) mul r6, r6, r3 st r6, (r2) add r2, r2, 4 ld r7, (r2) mul r7, r7, r3 st r7, (r2) add r2, r2, 4 ble r2, r5, loop ``` loop: Id r6, (r1) mul r6, r6, r3 st r6, (r1) add r2, r1, 4 Id r7, (r2) mul r7, r7, r3 st r7, (r2) add r1, r2, 4 ble r1, r5, loop Kostis Sagonas 118 Spring 2006 ## Loop Example ``` loop: ld r6, (r1) mul r6, r6, r3 st r6, (r1) add r2, r1, 4 ld r7, (r2) mul r7, r7, r3 st r7, (r2) add r1, r2, 4 ble r1, r5, loop ``` ``` loop: ld r6, (r1) mul r6, r6, r3 st r6, (r1) add r2, r1, 4 ld r7, (r2) mul r7, r7, r3 st r7, (r2) add r1, r2, 4 ble r1, r5, loop ``` Kostis Sagonas 119 Spring 2006 ## Loop Example ``` loop: ld r6, (r1) mul r6, r6, r3 st r6, (r1) add r2, r1, 4 ld r7, (r2) mul r7, r7, r3 st r7, (r2) add r1, r2, 4 ble r1, r5, loop ``` loop: ld r6, (r1) mul r6, r6, r3 st r6, (r1) add r2, r1, 4 ld r7, (r2) mul r7, r7, r3 st r7, (r2) add r1, r1, 8 ble r1, r5, loop tis Sagonas 120 Spring 2006 # Loop Example loop: Id r6, (r1) mul r6, r6, r3 st r6, (r1) add r2, r1, 4 Id r7, (r2) mul r7, r7, r3 st r7, (r2) add r1, r1, 8 ble r1, r5, loop Schedule (4.5 cycles per iteration) ## Outline - · Modern architectures - Delay slots - Introduction to instruction scheduling - List scheduling - · Resource constraints - Interaction with register allocation - Scheduling across basic blocks - Trace scheduling - · Scheduling for loops - · Loop unrolling - Software pipelining mas 122 # **Software Pipelining** - Try to overlap multiple iterations so that the slots will be filled - Find the steady-state window so that: - all the instructions of the loop body are executed - but from different iterations Kostis Sagonas 123 Spring 2006 ### #### Loop Example Assembly Code loop: ld3 r6, (r2) ld mul2 mul r6, r6, r3 mul2 r6, (r2) add r2, r2, 4 r2, r5, loop ble Schedule (2 cycles per iteration) | Loop Exam | ple | | | |-----------------------------------------------------------------------------------------------------------------------------------------|-----------|------------|-------------| | 4 iterations are overlapped | | ld3 | st1 | | - values of r3 and r5 don't chan | ge | st | ld3 | | | | mul2 | ble | | <ul><li>4 regs for &amp;A[i] (r2)</li></ul> | | | mul2 | | - each addr. incremented by 4*4 | ı l | mul1 | | | - cach addr. incremented by 4 - | ١ | | add1 | | - 4 regs to keep value A[i] (r6) | | add | | | <ul> <li>Same registers can be reused<br/>after 4 of these blocks<br/>generate code for 4 blocks,<br/>otherwise need to move</li> </ul> | st<br>add | r6,<br>r6, | r2, 4 | | Kastic Sanonas 126 | | | Spring 2006 | # Software Pipelining - Optimal use of resources Need a lot of registers Values in multiple iterations need to be kept - · Issues in dependencies - Issues in dependencies Executing a store instruction in an iteration before branch instruction is executed for a previous iteration (writing when it should not have) Loads and stores are issued out-of-order (need to figure-out dependencies before doing this) Code generation issues - - Generate pre-amble and post-amble code Multiple blocks so no register copy is needed 127