Final examination Guidelines:
- It will be an open-book
examination.
- Do not save your preparation time
because it is open-book. Treat it as being close-book.
- How to prepare (suggested
procedure):
- Read
through the rest of the guideline
- Review
the lecture slides and your textbook
- Read
through the guideline again, note down topics that you are not confident
- Review
the lecture slides and/or your textbook for better understanding on those
topics
- If
you still have questions, ask
the instructor in Wednesday and Friday classes, during his office hour,
and/or by e-mail
- Repeat
steps 3, 4 and 5 until you are confident about the topics (at least most
of them)
Coverage: Chapter 6, 7, and 8 and some knowledge from other
chapters
MIPS PIPELINE
Pipelining Overview
- Instruction
throughput vs. latency (pipelining improves instruction throughput)
- Theoretical
speedup by using n-stage pipeline
- The
concept of pipeline hazards: situations (e.g. specific instruction
sequences) that simple pipelining does not work correctly
- Type
of pipeline hazards: structural, control, and data
- Removing
structural hazard: split instruction and data memory
- Data
hazard, forwarding, and stall: concept, limits and performance effects
- Forwarding
is passing data from a later pipeline stage to an earlier pipeline stage
-
Forwarding is passing data from one instruction to instructions that need the
data before it is written into the register file.
- Forwarding
has a fundamental limit: forwarding is impossible when the data is not
available yet
- Forwarding
can eliminate all potential pipeline stall related to R-type instruction
as the producer
- Forwarding
cannot eliminate all potential pipeline stall related to LW instruction
as the producer, but can reduce the stall to one cycle.
- Register
file is written in the first half of clock cycle and read in the second
half in the clock cycle, reducing the complexity of forwarding
- Note:
Data forwarding deals with data passing through register file; data
passing through memory (e.g. “SW $9, 0($8); LW $10, 0($8)”) does not
stall MIPS pipeline
- Control
hazard: concept, solutions, and performance effects
- Instruction
fetch is dependent on previous branch decision
- Several
possible solutions
- Might
stall pipeline on branches (not really used)
- Always
fetch instructions sequentially (predict always not taken)
- Dynamically
predict branch direction
- Use
delayed branch to reduce performance penalty
- Calculating
control stall: #mis-prediction * miss penalty
- If
predicting always not taken, then the penalty is “#(taken branch) * miss
penalty”
Pipelined datapath
- 5-stage
pipeline: IF ID EX MEM WB
- Execution
of independent instructions on 5-stage MIPS pipeline
- The
use of pipeline registers
- Details
of data flow through pipeline
Pipelined control
- Passing
control through pipeline stages with data
- Details
of passing control values through pipeline
Data Hazards and forwarding
- Data
dependence: how to find data dependence for given instructions
- Hazard
detection equations for EX hazard and MEM hazard
- Multiplexors and controls for data forwarding
Data Hazards and Stalls
- Hazard
detection equation
- How
to stall the pipeline for data hazard
- Hold
the instruction at IF and ID
- Insert
a NOOP at EX
- Let
instructions at EX and MEM go to their next stages.
- Changes
made to pipelined control and data path
- Add
PCWrite control to PC to stall the instruction
at IF
- Add
IF/IDWrite control to IF/ID register to stall
the instruction at ID
- Add
a new mux to clear control values in ID/EX
register so as to insert a NOOP at EX
- Notes
on implementation details
- To
hold instruction at IF is to keep PC unchanged
- To
hold instruction at ID is to keep IF/ID register unchanged
- To
insert NOOP at EX is to set the controls in ID/EX register to be zeros
while holding the instruction at ID
Branch Hazards
- Early
branch decision at ID stage: moving the branch adder and branch test to ID
stage
- How
to flush
- Flush
the instruction at IF
- Let
instructions at ID (the branch), EX, and MEM move to their next stages
- Changes
to pipelined control and datapath
- Add
IF.Flush control to IF/ID register
- When
IF.Flush is asserted, set IF/ID register to be
zeros so as to flush the instruction at IF
- Notes
on flushing
- To flush
the instruction at IF is to set IF/ID register to be zeros and not to hold the instruction
at IF
- The
instruction is effectively converted to a NOOP
- Additional
data forwarding for early branch decision at ID
Exceptions
- Exception
vs. interrupt in MIPS implementation
- Sources
of exceptions
- The
use of EPC and Cause registers
- The
implementation of exception handling for ALU overflow
- The
change of PC mux (adding 40000040 input) so as
to transfer the execution to OS exception handler
- The
instructions at IF, ID, and EX should be flushed
- Notes
on exception implementation
- Flushing
the instruction at IF is combined with that for branch instructions
- Flushing
the instruction at ID is combined into the insertion of NOOP at EX for
data stall
- Flushing
the instruction at EX is to set the controls in the EX/MEM register to be
zeros; the EX.Flush control and a couple of muxs are added for this purpose.
- Exception
can be thought as branch/jump, but can happen at any stage
- Multiple
exceptions should be handled in the program order of the instructions
causing the exceptions
Superscalar and dynamic scheduling will NOT be covered in
the final.
MEMORY HIERARCHY
Overview
- The
principle of locality
- Two
types of locality: temporal and spatial
- Memory
limitation: either small and fast (and expensive), or large and slow (and
inexpensive)
- Memory
hierarchy: multi-level memories with smaller and faster memory on the top
- Memory
technologies: SRAM, DRAM, and disk
- Some
concepts:
- Block:
the unit of data moved between levels and cached at a higher level
- Hit:
the referenced data is found in the upper level (e.g. cache hit)
- Miss:
the referenced data is not found in the upper level (e.g. cache miss);
data must be fetched from the lower level
- Hit
time and miss penalty
The basics of caches
- Direct
mapped cache organization
- Tag
and data block
- Address
format for cache mapping: tag,
index, and block offset (for a given reference address, how to
determine which cache block to check, and how to determine the tag value;
how to determine the number of bits in each field)
- Ex: Assume an 8KB direct mapped cache of 32-byte blocks and 32-bit memory
address. (1) block size 32 = 2^5, thus #offset-bit = 5. (2) cache size is 8KB
= 2^13 byte, thus #block = 2^13/2^5 = 2^8, and #block-bit = 8. (3) #tag-bit +
#block-bit + #offset-bit = 32, thus #tag-bit = 19.
- Some calculation tips: 1K = 2^10, 1M = 2^20, 1G = 2^30; 2^3 = 8, 2^4 = 16,
...
- Tag
comparison to determine whether the reference is a hit or miss
- Calculation
of cache overhead (the ratio of storage for tag)
- Using
larger block size to exploit spatial locality (e.g. if block size is
doubled, how many misses may be reduced for a given access pattern)
- DRAM
memory design to support cache
- DRAM
memory property: Cycle time is more than access time, i.e., DRAM must be
idle for a certain time after an access
- Use
multiple memory banks to support pipelined data transfer for a large
block
Measuring and improving cache performance
- Simple
cache performance evaluation
- Memory
stall cycles = misses per instruction * miss penalty * #instruction
- Memory
stall cycles = misses per reference * miss penalty * #references
- #reference
= #instruction * (percentage of memory reference instructions)
- CPI
increase for cache misses is usually asked
- For
convenience of calculation, 100 instructions can be used as the base of
calculation
- The
use of fully set associative and set-associative cache
- Address
mapping format for set-associative cache: tag, index, and offset
- Tag
comparison for set-associative or fully-associative cache
- LRU
replacement policy
- Implement
LRU using counters
- Determine
misses for an access sequence using LRU policy
- Multilevel
cache used to reduce miss penalty (discussed below as two-level cache)
- L1
cache can be optimized for hit time and L2 cache for reducing miss rates
- Global
hit/miss rates vs. local hit/miss rates
- Cache
performance evaluation for multilevel cache: memory stall cycles = stall
cycles for L1 misses + stall cycles for L2 cache misses
- L2
miss penalty includes the hit time of L2 cache
- EX:
5% L1 miss, 2% L2 miss, 20-cycle L2 access time, 100-cycle DRAM access
time (global miss rates); two ways
of calculating stall cycles for 100 instructions
- (5%
- 2%) * 20 + 2% * (100 + 20) = 300
- 5%
* 20 + 2% * 100 = 300
- Note:
L2 cache must be checked before finding out a DRAM access is necessary,
thus for misses to DRAM (L2 misses) the penalty is 120 instead of 100.
- Note:
L1 cache access time is usually counted into the ideal execution time,
thus L1 miss penalty is 20 cycles instead of 21 cycles.
Virtual Memory
- Why
virtual memory: larger memory space for programs, protection, relocation,
and sharing physical memory
- Page
table organization: physical page number, virtual page number, valid bit
- Address
format for VM access: virtual page number + page offset
- Calculation
of page table size
- Page
faults and their handling
- TLB:
cache for page table
- The
importance of TLB: without TLB every memory reference must go through
page table
- What’s
tag in TLB? How to calculate the number of bits in TLB tag?
- Fully
associative TLB vs. set-associative TLB
- Putting
VM and cache together
- The
procedure to go through TLB, page table, caches, and DRAM
- Calculate
memory stall cycles in presence of L1 misses, L2 misses, TLB misses, and
page faults
Common Framework for memory hierarchy
- Four
key issues: mapping memory blocks onto cache, determining cache hit/miss,
cache replacement, and write policy
- Cache
miss classification: compulsory, capacity, and conflict
- Write
policies
- Write-through
vs. Write-back (on cache hits, write the cache block into DRAM or not)
-
Write-allocate vs. write-around (on cache misses, load the DRAM block into cache or
not)
- Write
policy complicates cache performance evaluation
- We
had assumed writes behave the same as reads in calculating memory stall
cycles, which is not correct
- For
write-back cache, a read miss may cause the replacement of a dirty block.
Then the miss penalty may double because of the additional DRAM access to
write the dirty block back into memory (depending on the specific
design).
- For
write-through cache, the write buffer may be full when a write hit
happens. Thus a write hit can
cause pipeline stall.
- For
write-allocate cache, a write miss may cause the same penalty as read
miss
- For
write-around cache, a write miss may not cause pipeline stall
- Study
the “Big Example”, week 13 slide 10.
I/O SYSTEMS
- Types
and characteristics of I/O devices
- Interfacing
I/O device to processor and memory
- Polling,
interrupt I/O and DMA
-
Calculating processor overhead ratio: #IO-event-per-second * cost /
#inst-per-second
- Buses
- Memory
bus, backplane bus, and I/O bus
- Bus
arbitration
- Determining
bus transaction latency and effective data transfer rate
- Disk
drive
- Disk
drive internal structure
- Determining
disk access latency and maximal I/O rate