

#### Reading

Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

## Prefetching

- Predict future cache misses
- Issue a fetch to memory system in advance of the actual memory reference
- Hide memory access latency





# Prefetching Approaches

- Software-based
  - □ Explicit "fetch" instructions
  - Additional instructions executed
- Hardware-based
  - □ Special hardware
  - □Unnecessary prefetchings (w/o compiletime information)



### Software Data Prefetching

- "fetch" instruction
   Non-blocking memory operation
   Cannot cause exceptions (e.g. page faults)
- Modest hardware complexity
- Challenge -- prefetch scheduling
   Placement of fetch inst relative to the matching
  - load or store inst Hand-coded by programmer or automated by compiler

#### Loop-based Prefetching

- Loops of large array calculations
   Common in scientific codes
   Poor cache utilization
  - Predictable array referencing patterns
- fetch instructions can be placed inside loop bodies s.t. current iteration prefetches data for a future iteration

#### **Example: Vector Product** No prefetching Simple prefetching for (i = 0; i < N; i++) { for (i = 0; i < N; i++) { sum += a[i]\*b[i]; fetch (&a[i+1]); } fetch (&b[i+1]); Assume each cache block sum += a[i]\*b[i]; holds 4 elements } $\rightarrow$ 2 misses/4 iterations Problem Unnecessary prefetch operations, e.g. a[1], a[2], a[3]

10



| Example: Vector                                                                                                                                                                                                                                                                                                                                                      | Product (Cont.)                                                                                                                                                                                                                                                                                                                                                |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Previous assumption:<br>prefetching 1 iteration ahead is<br>sufficient to hide the memory<br>latency<br>When loops contain small<br>computational bodies, it may be<br>necessary to initiate prefetches<br>iterations before the data is<br>reference<br>$\delta = \left\lceil \frac{I}{s} \right\rceil$<br>8: prefetch distance, I: avg<br>memory latency, s is the | <pre>fetch (fsum);<br/>for (i = 0; i &lt; 12; i += 4) {<br/>fetch (fsi(1);<br/>fetch (fsi(1);<br/>}<br/>)<br/>for (i = 0; i &lt; N-12; i += 4) {<br/>fetch (fsi(1+12));<br/>fetch (fsi(1+12));<br/>sum - sum + a(1+1)*[1];<br/>sum - sum + a(1+1)*[1+1];<br/>sum - sum + a(1+1)*[1+2];<br/>aum - sum + a(1+1)*[1+2];<br/>aum - sum + a(1+1)*[1+2];<br/>}</pre> |
| estimated cycle time of the<br>shortest possible execution path<br>through one loop iteration                                                                                                                                                                                                                                                                        | <pre>for (i = N-12; i &lt; N; i++) sum = sum + a[i]*b[i];</pre>                                                                                                                                                                                                                                                                                                |

#### Limitation of Software-based Prefetching

Normally restricted to loops with array accesses

13

16

- Hard for general applications with irregular access patterns
- Processor execution overhead
- Significant code expansion
- Performed statically



# Sequential Prefetching

- Take advantage of spatial locality
- One block lookahead (OBL) approach
  - □ Initiate a prefetch for block *b*+1 when block *b* is accessed
  - Prefetch-on-miss
  - Whenever an access for block *b* results in a cache miss □ Tagged prefetch
    - Associates a tag bit with every memory block
    - When a block is demand-fetched or a prefetched block is referenced for the first time.

# OBL Approaches Prefetch-on-miss Prefetched demand-fetched prefetched pr



#### Stream Buffer

- K prefetched blocks → FIFO stream buffer
- As each buffer entry is referenced
   Move it to cache
  - Prefetch a new block to stream buffer
- Avoid cache pollution

#### Prefetching with Arbitrary Strides

- Employ special logic to monitor the processor's address referencing pattern
- Detect constant stride array references originating from looping structures
- Compare successive addresses used by load or store instructions

19

22





#### Reference Prediction Table (RPT)

- Hold information for the most recently used memory instructions to predict their access pattern
  - $\hfill\square$  Address of the memory instruction
  - Previous address accessed by the instruction
  - Stride value
  - □ State field





