### Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha 21264 Example 1 ## Register and Memory Dependences Store: SW Rt, A(Rs) - Calculate effective memory address ⇒ dependent on Rs - 2. Write to D-Cache ⇒ dependent on Rt, and cannot be speculative Compare "ADD Rd, Rs, Rt" What is the difference? LW Rt. A(Rs) - Calculate effective memory address ⇒ dependent on Rs - 2. Read D-Cache ⇒ could be memory-dependent on pending writes! When is the memory dependence known? 2 ### Memory Correctness and Performance Correctness conditions: - Only committed store instructions can write to memory - Any load instruction receives its memory operand from its parent (a store instruction) - At the end of execution, any memory word receives the value of the last write Performance: Exploit memory level parallelism 3 ### Load/store Buffer in Tomasulo - Original Tomasulo: Load/store address are precalculated before scheduling - Loads are not dependent on other instructions - Stores are dependent on instructions producing the store data - Provide dynamic memory disambiguation: check the memory dependence between stores and loads 4 # Dynamic Scheduling with Integer Instructions Centralized design example: - Centralized reservation stations usually include the load buffer - Integer units are shared by load/store and ALU instructions - What is the challenge in detecting memory dependence? ### Load/Store with Dynamic Execution - Only committed store instructions can write to memory Use store buffer as a temporary place for write - ⇒ Use store buffer as a temporary place for write instruction output - Any memory word receives the value of the last write ⇒ Store instructions write to memory in program order - Any memory word receives the value of the last write - Memory level parallelism be exploited - ⇒ Non-speculative solution: load bypassing and load forwarding - ⇒ Speculative solution: speculative load execution 6 ### Store Buffer Design Example Store instruction: R5 Wait in RS until the base address and data are ready Calculate address, move to store buffer Move data directly to store huffer Arch. Wait for commit states 1 If no exception/mis-predict 5. Wait for memory port To D-Cache 6. Write to D-cache ## Memory Dependence Any load instruction receives the memory operand from its parent (a store instruction) - If any previous store has not written the D-cache, what to do? - If any previous store has not finished, what to do? Simple Design: Delay all following loads; but how about performance? 8 # Memory-level Parallelism for (i=0;i<100;i++) A[i] = A[i]\*2; Loop:L.S F2, 0(R1) MULT F2, F2, F4 SW F2, 0(R1) ADD R1, R1, 4 BNE R1, R3,Loop Otherwise flushed before writing D-cache F4 store 2.0 Significant improvement from sequential reads/writes 9 ### Load Bypassing and Load Forwarding Non-speculative solution RS Dynamic Disambiguation: Match the load address with Store I-FU all store addresses unit Load bypassing: start cache read if no match is found match Load forwarding: using store buffer value if a match is found In-order execution limitation: must wait until all D-cache previous store have finished 10 ### In-order Execution Limitation Example 1: for (i=0;i<100;i++) A[i] = A[i]/2; Loop:L.S F2, 0(R1) DIV F2, F2, F4</pre> DIV F2, F2, F4 SW F2, 0(R1) ADD R1, R1, 4 BNE R1, R3,Loop Example 2: a->b->c = 100; d = x; Example 1: When is the SW result available, and when can the next load start? Possible solution: start store address calculation early ⇒ more complex design Example2: When is the address "a->b->c" 11 # Speculative Load Execution If no dependence predicted Send loads out even if dependence is unknown Do address matching at store commits Match found: memory dependence violation, flush pipeline; Otherwise: continue Note: may still need load forwarding (not shown) # Summary of Superscalar Execution Instruction flow techniques Branch prediction, branch target prediction, and instruction prefetch Register data flow techniques Register renaming, instruction scheduling, in-order commit, mis-prediction recovery Memory data flow techniques Load/store units, memory consistency Source: Shen & Lipasti reference book