Concurrency, latency, or system overhead: which has the largest impact on uniprocessor DRAM-system performance?

Given a fixed CPU architecture and a fixed DRAM timing specification, there is still a large design space for a DRAM system organization. Parameters include the number of memory channels, the bandwidth of each channel, burst sizes, queue sizes and organizations, turnaround overhead, memory-controller page protocol, algorithms for assigning request priorities and scheduling requests dynamically, etc. In this design space, we see a wide variation in application execution times: for example, execution times for SPEC CPU 2000 integer suite on a 2-way ganged Direct Rambus organization (32 data bits) with 64-byte bursts are 10-20% lower than execution times on an otherwise identical configuration that uses 32-byte bursts. This represents two system configurations that are relatively close to each other in the design space; performance differences become even more pronounced for designs further apart. This paper characterizes the sources of overhead in high-performance DRAM systems and investigates the most effective ways to reduce a system's exposure to performance loss. In particular, we look at mechanisms to increase a system's support for concurrent transactions, mechanisms to reduce request latency, and mechanisms to reduce the “system overhead”—the portion of the primary memory system's overhead that is not due to DRAM latency but rather to things like turnaround time, request queueing, inefficiencies due to read/write request interleaving, etc. Our simulator models a 2GHz, highly aggressive out-of-order uniprocessor. The interface to the memory system is fully non-blocking, supporting up to 32 outstanding misses at both the level-1 and level-2 caches and split-transaction busses to all DRAM banks.

[1]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[2]  H. S. Stone Microcomputer Interfacing , 1982 .

[3]  E. Seevinck,et al.  Static-noise margin analysis of MOS SRAM cells , 1987 .

[4]  K. Ishibashi,et al.  An alpha -immune, 2-V supply voltage SRAM using a polysilicon PMOS load cell , 1990 .

[5]  Steven Przybylski,et al.  New DRAM Technologies: A Comprehensive Analysis of the New Architecture , 1994 .

[6]  Sally A. McKee,et al.  Access ordering and memory-conscious cache utilization , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[7]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[8]  William R. Bryg,et al.  A High-Performance, Low-Cost Multiprocessor Bus for Workstations and Midrange Servers , 1996 .

[9]  Sally A. McKee,et al.  Design and evaluation of dynamic access ordering hardware , 1996, ICS '96.

[10]  Thomas R. Hotchkiss,et al.  A New Memory System Design for Commercial and Technical Computing Products , 1996 .

[11]  Margo I. Seltzer,et al.  Operating system benchmarking in the wake of lmbench: a case study of the performance of NetBSD on the Intel x86 architecture , 1997, SIGMETRICS '97.

[12]  Reinhard C. Schumann,et al.  Design of the 21174 Memory Controller for DIGITAL Personal Workstations , 1997, Digit. Tech. J..

[13]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[14]  Lockup-free instruction fetch/prefetch cache organization , 1981, ISCA '98.

[15]  CarterJohn,et al.  Increasing TLB reach using superpages backed by shadow memory , 1998 .

[16]  Leigh Stoller,et al.  Increasing TLB reach using superpages backed by shadow memory , 1998, ISCA.

[17]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[18]  Trevor N. Mudge,et al.  A performance comparison of contemporary DRAM architectures , 1999, ISCA.

[19]  Sally A. McKee,et al.  Access order and effective bandwidth for streams on a Direct Rambus memory , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[20]  Trevor Mudge,et al.  DDR2 and Low Latency Variants , 2000 .

[21]  Trevor N. Mudge,et al.  The New DRAM Interfaces: SDRAM, RDRAM and Variants , 2000, ISHPC.