PDF
Cached DRAM for ILP Processor Memory Access Latency Reduction 

Zhao Zhang, Zhichun Zhu, and Xiaodong Zhang

IEEE Micro, Vol. 21, No. 4, July/August, 2001, pp. 22-32.

Abstract

As the speed gap between the processor and the memory continues to
widen, data-intensive applications are putting increasing demands on
the main memory system.  Cached DRAM is an existing technology that
adds a small cache onto the DRAM chip.  By exploiting the locality of
memory access streams missing the L2 cache, a cached DRAM can reduce
the average DRAM access time.  Previous studies have shown that cached
DRAM is effective on a relatively simple processor model with small
or even without data caches. Some recent studies have shown that
this technique can be effective on modern ILP processors as well.
Aiming at further investigating the ILP effects and comparing cached
DRAM with other advanced DRAM organizations and interleaving techniques,
we present a study of its design and optimization in the context of
processors with full ILP capabilities and large data caches. Conducting
an execution-driven simulation, we have evaluated its performance
effectiveness by 8 selected data-intensive SPECfp95 programs and the
TPC-C workload.  Our study provides three new findings (1) cached DRAM
is able to consistently show its performance advantage as the ILP degree
increases; (2)~contemporary DRAM schemes, such as SDRAM, Enhanced SDRAM,
Rambus DRAM, and Direct Rambus DRAM, do not exploit memory access
locality of data-intensive workloads as effectively as a cached DRAM
does; and (3) compared with an highly effective permutation-based DRAM
interleaving technique, Cached DRAM can still gain substantial performance
improvement because it fully utilizes the bus bandwidth by overlapping
a large number of concurrent memory accesses, and minimizes conflict
misses in the on-memory caches and/or row-buffers.