### Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time

Hardware prefetching and stream buffer, software prefetching, virtually indexed cache,

# Reducing Misses by <u>Hardware</u> Prefetching of Instructions & Data E.g., Instruction Prefetching Alpha 21064 fetches 2 blocks on a miss Extra block placed in "stream buffer" On miss check stream buffer Works with data blocks too: Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty





# Reducing Misses by <u>Software</u> Prefetching Data

### Data Prefetch

- Load data into register (HP PA-RISC loads)
- Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
- Special prefetching instructions cannot cause faults; a form of speculative execution
- Prefetching comes in two flavors:
  - Binding prefetch: Requests load directly into register.
  - Must be correct address and register!
  - Non-Binding prefetch: Load into cache.
  - Can be incorrect. Frees HW/SW to guess!
- ✤ Issuing Prefetch Instructions takes time
  - Is cost of prefetch issues < savings in reduced misses?</p>
  - Higher superscalar reduces difficulty of issue bandwidth

Improving Cache Performance 3. Reducing miss penalty or 1. Reducing miss rates miss rates via parallelism Larger block size Non-blocking caches larger cache size Hardware prefetching higher associativity Compiler prefetching victim caches way prediction and 4. Reducing cache hit time Pseudoassociativity Small and simple compiler optimization caches 2. Reducing miss penalty Avoiding address Multilevel caches translation critical word first Pipelined cache access read miss first Trace caches merging write buffers

5

1

2





### Fast Cache Hits by Avoiding Translation: Process ID impact

- Black is uniprocess
- Light Gray is multiprocess when flush cache
- Dark Gray is multiprocess when use Process ID tag
- Y axis: Miss Rates up to 20%
- X axis: Cache size from 2 KB to 1024 KB





## **Pipelined Cache Access**

For multi-issue, cache bandwidth affects *effective* cache hit time

 Queueing delay adds up if cache does not have enough read/write ports

Pipelined cache accesses: reduce cache cycle time and improve bandwidth

Cache organization for high bandwidth

- Duplicate cache
- Banked cache
- Double clocked cache

11

# **Pipelined Cache Access**

### Alpha 21264 Data cache design

- The cache is 64KB, 2-way associative; cannot be accessed within one-cycle
- One-cycle used for address transfer and data transfer, pipelined with data array access
- Cache clock frequency doubles processor frequency; wave pipelined to achieve the speed





|            | Technique                                                                                  | MP | MR | HT Co | mplexity |
|------------|--------------------------------------------------------------------------------------------|----|----|-------|----------|
|            | Multilevel cache                                                                           | +  |    |       | 2        |
| ج °        | Critical work first<br>Read first<br>Merging write buffer<br>Victim caches<br>Larger block | +  |    |       | 2        |
| enalt      | Read first                                                                                 | +  |    |       | 1        |
| - <u>a</u> | Merging write buffer                                                                       | +  |    |       | 1        |
|            | Victim caches                                                                              | +  | +  |       | 2        |
|            | Larger block                                                                               | -  | +  |       | 0        |
| ate        | Larger cache                                                                               |    | +  | -     | 1        |
| miss rate  | Higher associativity                                                                       |    | +  | -     | 1        |
| Ë          | Way prediction                                                                             |    | +  |       | 2        |
|            | Pseudoassociative                                                                          |    | +  |       | 2        |
|            | Compiler techniques                                                                        |    | +  |       | 0        |

|      | Cache Optimization Summary |                              |    |       |            |  |  |  |  |  |  |
|------|----------------------------|------------------------------|----|-------|------------|--|--|--|--|--|--|
|      |                            | Technique                    | MP | MR HT | Complexity |  |  |  |  |  |  |
| miss | penalty                    | Nonblocking caches           | +  |       | 3          |  |  |  |  |  |  |
|      |                            | Hardware prefetching         | +  |       | 2/3        |  |  |  |  |  |  |
|      |                            | Software prefetching         | +  | +     | 3          |  |  |  |  |  |  |
|      | hit time                   | Small and simple cache       |    | - +   | 0          |  |  |  |  |  |  |
|      |                            | Avoiding address translation |    | +     | 2          |  |  |  |  |  |  |
|      |                            | Pipeline cache access        |    | +     | 1          |  |  |  |  |  |  |
|      |                            | Trace cache                  |    | +     | 3          |  |  |  |  |  |  |
|      |                            |                              |    |       | 17         |  |  |  |  |  |  |