DRAM-Level Prefetching for Fully-Buffered DIMM: Design, Performance and Power Saving

We have studied DRAM-level prefetching for the fully buffered DIMM (FB-DIMM) designed for multi-core processors. FB-DIMM has a unique two-level interconnect structure, with FB-DIMM channels at the first-level connecting the memory controller and advanced memory buffers (AMBs); and DDR2 buses at the second-level connecting the AMBs with DRAM chips. We propose an AMB prefetching method that prefetches memory blocks from DRAM chips to AMBs. It utilizes the redundant bandwidth between the DRAM chips and AMBs but does not consume the crucial channel bandwidth. The proposed method fetches K memory blocks of L2 cache block sizes around the demanded block, where K is a small value ranging from two to eight. The method may also reduce the DRAM power consumption by merging some DRAM precharges and activations. Our cycle-accurate simulation shows that the average performance improvement is 16% for single-core and multi-core workloads constructed from memory-intensive SPEC2000 programs with software cache prefetching enabled; and no workload has negative speedup. We have found that the performance gain comes from the reduction of idle memory latency and the improvement of channel bandwidth utilization. We have also found that there is only a small overlap between the performance gains from the AMB prefetching and the software cache prefetching. The average of estimated power saving is 15%

[1]  David J. Lilja,et al.  Data prefetch mechanisms , 2000, CSUR.

[2]  Kunle Olukotun,et al.  Maximizing CMP throughput with mediocre cores , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[3]  James E. Smith,et al.  Performance Of Cached Dram Organizations In Vector Supercomputers , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[4]  Aamer Jaleel,et al.  Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[5]  Wei-Fen Lin,et al.  Reducing DRAM latencies with an integrated memory hierarchy design , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[6]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[7]  Zhao Zhang,et al.  Cached DRAM for ILP Processor Memory Access Latency Reduction , 2001, IEEE Micro.

[8]  Robert Cypher,et al.  Trends and trade-offs in designing highly robust throughput computing oriented chips and systems , 2005, 11th IEEE International On-Line Testing Symposium.

[9]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[10]  Jon Haas,et al.  Fully-Buffered DIMM Technology Moves Enterprise Platforms to the Next Level , 2005 .

[11]  Nathan L. Binkert,et al.  Network-Oriented Full-System Simulation using M5 , 2003 .

[12]  Zhao Zhang,et al.  A performance comparison of DRAM memory system optimizations for SMT processors , 2005, 11th International Symposium on High-Performance Computer Architecture.

[13]  Gershon Kedem,et al.  WCDRAM: A fully associative integrated Cached-DRAM with wide cache lines , 1997 .

[14]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[15]  T. Sherwood,et al.  Predictor-directed stream buffers , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[16]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[17]  Rami Marwan Nasr,et al.  FBsim and the Fully Buffered DIMM Memory System Architecture , 2005 .

[18]  Hideto Hidaka,et al.  The cache DRAM architecture: a DRAM with an on-chip cache memory , 1990, IEEE Micro.

[19]  Dean M. Tullsen,et al.  Symbiotic jobscheduling with priorities for a simultaneous multithreading processor , 2002, SIGMETRICS '02.

[20]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[21]  James R. Goodman,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[22]  Trevor N. Mudge,et al.  A performance comparison of contemporary DRAM architectures , 1999, ISCA.

[23]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[24]  Charles A. Hart CDRAM in a unified memory architecture , 1994, Proceedings of COMPCON '94.