Final examination Guidelines:

It will be an open-book examination.

Do not save your preparation time because it is open-book. Treat it as being close-book.

How to prepare (suggested procedure):

Read through the rest of the guideline
Review the lecture slides and your textbook
Read through the guideline again, note down topics that you are not confident
Review the lecture slides and/or your textbook for better understanding on those topics
If you still have questions, ask the instructor in Wednesday and Friday classes, during his office hour, and/or by e-mail
Repeat steps 3, 4 and 5 until you are confident about the topics (at least most of them)

Coverage: Chapter 6, 7, and 8 and some knowledge from other chapters

MIPS PIPELINE

Pipelining Overview

Instruction throughput vs. latency (pipelining improves instruction throughput)
Theoretical speedup by using n-stage pipeline
The concept of pipeline hazards: situations (e.g. specific instruction sequences) that simple pipelining does not work correctly
Type of pipeline hazards: structural, control, and data
Removing structural hazard: split instruction and data memory
Data hazard, forwarding, and stall: concept, limits and performance effects

Forwarding is passing data from a later pipeline stage to an earlier pipeline stage
Forwarding is passing data from one instruction to instructions that need the data before it is written into the register file.
Forwarding has a fundamental limit: forwarding is impossible when the data is not available yet

Forwarding can eliminate all potential pipeline stall related to R-type instruction as the producer
Forwarding cannot eliminate all potential pipeline stall related to LW instruction as the producer, but can reduce the stall to one cycle.

Register file is written in the first half of clock cycle and read in the second half in the clock cycle, reducing the complexity of forwarding
Note: Data forwarding deals with data passing through register file; data passing through memory (e.g. “SW $9, 0($8); LW $10, 0($8)”) does not stall MIPS pipeline

Control hazard: concept, solutions, and performance effects

Instruction fetch is dependent on previous branch decision
Several possible solutions

Might stall pipeline on branches (not really used)
Always fetch instructions sequentially (predict always not taken)
Dynamically predict branch direction
Use delayed branch to reduce performance penalty

Calculating control stall: #mis-prediction * miss penalty

If predicting always not taken, then the penalty is “#(taken branch) * miss penalty”

Pipelined datapath

5-stage pipeline: IF ID EX MEM WB
Execution of independent instructions on 5-stage MIPS pipeline
The use of pipeline registers
Details of data flow through pipeline

Pipelined control

Passing control through pipeline stages with data
Details of passing control values through pipeline

Data Hazards and forwarding

Data dependence: how to find data dependence for given instructions
Hazard detection equations for EX hazard and MEM hazard
Multiplexors and controls for data forwarding

Data Hazards and Stalls

Hazard detection equation
How to stall the pipeline for data hazard

Hold the instruction at IF and ID
Insert a NOOP at EX
Let instructions at EX and MEM go to their next stages.

Changes made to pipelined control and data path

Add PCWrite control to PC to stall the instruction at IF
Add IF/IDWrite control to IF/ID register to stall the instruction at ID
Add a new mux to clear control values in ID/EX register so as to insert a NOOP at EX

Notes on implementation details

To hold instruction at IF is to keep PC unchanged
To hold instruction at ID is to keep IF/ID register unchanged
To insert NOOP at EX is to set the controls in ID/EX register to be zeros while holding the instruction at ID

Branch Hazards

Early branch decision at ID stage: moving the branch adder and branch test to ID stage
How to flush

Flush the instruction at IF
Let instructions at ID (the branch), EX, and MEM move to their next stages

Changes to pipelined control and datapath

Add IF.Flush control to IF/ID register
When IF.Flush is asserted, set IF/ID register to be zeros so as to flush the instruction at IF

Notes on flushing

To flush the instruction at IF is to set IF/ID register to be zeros and not to hold the instruction at IF
The instruction is effectively converted to a NOOP

Additional data forwarding for early branch decision at ID

Exceptions

Exception vs. interrupt in MIPS implementation
Sources of exceptions
The use of EPC and Cause registers
The implementation of exception handling for ALU overflow

The change of PC mux (adding 40000040 input) so as to transfer the execution to OS exception handler
The instructions at IF, ID, and EX should be flushed

Notes on exception implementation

Flushing the instruction at IF is combined with that for branch instructions
Flushing the instruction at ID is combined into the insertion of NOOP at EX for data stall
Flushing the instruction at EX is to set the controls in the EX/MEM register to be zeros; the EX.Flush control and a couple of muxs are added for this purpose.
Exception can be thought as branch/jump, but can happen at any stage
Multiple exceptions should be handled in the program order of the instructions causing the exceptions

Superscalar and dynamic scheduling will NOT be covered in the final.

MEMORY HIERARCHY

Overview

The principle of locality
Two types of locality: temporal and spatial
Memory limitation: either small and fast (and expensive), or large and slow (and inexpensive)
Memory hierarchy: multi-level memories with smaller and faster memory on the top
Memory technologies: SRAM, DRAM, and disk
Some concepts:

Block: the unit of data moved between levels and cached at a higher level
Hit: the referenced data is found in the upper level (e.g. cache hit)
Miss: the referenced data is not found in the upper level (e.g. cache miss); data must be fetched from the lower level
Hit time and miss penalty

The basics of caches

Direct mapped cache organization

Tag and data block
Address format for cache mapping: tag, index, and block offset (for a given reference address, how to determine which cache block to check, and how to determine the tag value; how to determine the number of bits in each field)
- Ex: Assume an 8KB direct mapped cache of 32-byte blocks and 32-bit memory address. (1) block size 32 = 2^5, thus #offset-bit = 5. (2) cache size is 8KB = 2^13 byte, thus #block = 2^13/2^5 = 2^8, and #block-bit = 8. (3) #tag-bit + #block-bit + #offset-bit = 32, thus #tag-bit = 19.
- Some calculation tips: 1K = 2^10, 1M = 2^20, 1G = 2^30; 2^3 = 8, 2^4 = 16, ...

Tag comparison to determine whether the reference is a hit or miss
Calculation of cache overhead (the ratio of storage for tag)

Using larger block size to exploit spatial locality (e.g. if block size is doubled, how many misses may be reduced for a given access pattern)
DRAM memory design to support cache

DRAM memory property: Cycle time is more than access time, i.e., DRAM must be idle for a certain time after an access
Use multiple memory banks to support pipelined data transfer for a large block

Measuring and improving cache performance

Simple cache performance evaluation

Memory stall cycles = misses per instruction * miss penalty * #instruction
Memory stall cycles = misses per reference * miss penalty * #references
#reference = #instruction * (percentage of memory reference instructions)
CPI increase for cache misses is usually asked
For convenience of calculation, 100 instructions can be used as the base of calculation

The use of fully set associative and set-associative cache

Address mapping format for set-associative cache: tag, index, and offset
Tag comparison for set-associative or fully-associative cache
LRU replacement policy

Implement LRU using counters
Determine misses for an access sequence using LRU policy

Multilevel cache used to reduce miss penalty (discussed below as two-level cache)

L1 cache can be optimized for hit time and L2 cache for reducing miss rates
Global hit/miss rates vs. local hit/miss rates
Cache performance evaluation for multilevel cache: memory stall cycles = stall cycles for L1 misses + stall cycles for L2 cache misses
L2 miss penalty includes the hit time of L2 cache
EX: 5% L1 miss, 2% L2 miss, 20-cycle L2 access time, 100-cycle DRAM access time (global miss rates); two ways of calculating stall cycles for 100 instructions

(5% - 2%) * 20 + 2% * (100 + 20) = 300
5% * 20 + 2% * 100 = 300
Note: L2 cache must be checked before finding out a DRAM access is necessary, thus for misses to DRAM (L2 misses) the penalty is 120 instead of 100.
Note: L1 cache access time is usually counted into the ideal execution time, thus L1 miss penalty is 20 cycles instead of 21 cycles.

Virtual Memory

Why virtual memory: larger memory space for programs, protection, relocation, and sharing physical memory
Page table organization: physical page number, virtual page number, valid bit
Address format for VM access: virtual page number + page offset
Calculation of page table size
Page faults and their handling
TLB: cache for page table

The importance of TLB: without TLB every memory reference must go through page table
What’s tag in TLB? How to calculate the number of bits in TLB tag?
Fully associative TLB vs. set-associative TLB

Putting VM and cache together

The procedure to go through TLB, page table, caches, and DRAM
Calculate memory stall cycles in presence of L1 misses, L2 misses, TLB misses, and page faults

Common Framework for memory hierarchy

Four key issues: mapping memory blocks onto cache, determining cache hit/miss, cache replacement, and write policy
Cache miss classification: compulsory, capacity, and conflict
Write policies

Write-through vs. Write-back (on cache hits, write the cache block into DRAM or not)
Write-allocate vs. write-around (on cache misses, load the DRAM block into cache or not)

Write policy complicates cache performance evaluation

We had assumed writes behave the same as reads in calculating memory stall cycles, which is not correct
For write-back cache, a read miss may cause the replacement of a dirty block. Then the miss penalty may double because of the additional DRAM access to write the dirty block back into memory (depending on the specific design).
For write-through cache, the write buffer may be full when a write hit happens. Thus a write hit can cause pipeline stall.
For write-allocate cache, a write miss may cause the same penalty as read miss
For write-around cache, a write miss may not cause pipeline stall
Study the “Big Example”, week 13 slide 10.

I/O SYSTEMS

Types and characteristics of I/O devices
Interfacing I/O device to processor and memory

Polling, interrupt I/O and DMA
Calculating processor overhead ratio: #IO-event-per-second * cost / #inst-per-second

Buses

Memory bus, backplane bus, and I/O bus
Bus arbitration
Determining bus transaction latency and effective data transfer rate

Disk drive

Disk drive internal structure
Determining disk access latency and maximal I/O rate