Design and optimization of large size and low overhead off-chip caches

Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited size due to the low density and high cost of SRAM and, thus, cannot hold the working sets of many memory-intensive applications. Second, since the tag checking overhead of large caches is nontrivial, the existence of L3 caches increases the cache miss penalty and may even harm the performance of some memory-intensive applications. To address these two issues, we present a new memory hierarchy design that uses cached DRAM to construct a large size and low overhead off-chip cache. The high density DRAM portion in the cached DRAM can hold large working sets, while the small SRAM portion exploits the spatial locality appearing in L2 miss streams to reduce the access latency. The L3 tag array is placed off-chip with the data array, minimizing the area overhead on the processor for L3 cache, while a small tag cache is placed on-chip, effectively removing the off-chip tag access overhead. A prediction technique accurately predicts the hit/miss status of an access to the cached DRAM, further reducing the access latency. Conducting execution-driven simulations for a 2 GHz 4-way issue processor and with 11 memory-intensive programs from the SPEC 2000 benchmark, we show that a system with a cached DRAM of 64 MB DRAM and 128 KB on-chip SRAM cache as the off-chip cache outperforms the same system with an 8 MB SRAM L3 off-chip cache by up to 78 percent measured by the total execution time. The average speedup of the system with the cached-DRAM off-chip cache is 25 percent over the system with the L3 SRAM cache.

[1]  James E. Smith,et al.  Performance Of Cached Dram Organizations In Vector Supercomputers , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[2]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[3]  Christoforos Kozyrakis,et al.  A Media-Enhanced Vector Architecture for Embedded Memory Systems , 1999 .

[4]  Jignesh M. Patel,et al.  Data prefetching by dependence graph precomputation , 2001, ISCA 2001.

[5]  Gary S. Tyson,et al.  A modified approach to data cache management , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[6]  James R. Goodman,et al.  Instruction Cache Replacement Policies and Organizations , 1985, IEEE Transactions on Computers.

[7]  Michael Shantz,et al.  Multi-level texture caching for 3D graphics hardware , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[8]  Chun Chen,et al.  The architecture of the DIVA processing-in-memory chip , 2002, ICS '02.

[9]  Zhao Zhang,et al.  Fine-grain priority scheduling on multi-channel memory systems , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[10]  Doug Burger System-Level Implications of Processor-Memory Integration , 1997 .

[11]  Zarka Cvetanovic,et al.  AlphaServer 4100 Performance Characterization , 1996, Digit. Tech. J..

[12]  Anoop Gupta,et al.  The Design and Analysis of a Cache Architecture for Texture Mapping , 1997, ISCA.

[13]  Lance Hammond,et al.  A Single Chip Multiprocessor Integrated with High Density DRAM , 1997 .

[14]  Michael E. Wazlowski,et al.  Pinnacle: IBM MXT in a Memory Controller Chip , 2001, IEEE Micro.

[15]  Kenneth M. Wilson,et al.  Designing High Bandwidth On-chip Caches , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[16]  Graham Kirsch Active memory: Micron's Yukon , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[17]  R. Norwood,et al.  Memory-a new era of fast dynamic RAMs (for video applications) , 1992, IEEE Spectrum.

[18]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[19]  Xiaowei Shen,et al.  Performance of hardware compressed main memory , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[20]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[21]  A. Seznec,et al.  Decoupled sectored caches: conciliating low tag implementation cost and low miss ratio , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[22]  Wei-Fen Lin,et al.  Reducing DRAM latencies with an integrated memory hierarchy design , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[23]  André Seznec,et al.  Decoupled sectored caches: conciliating low tag implementation cost , 1994, ISCA '94.

[24]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[25]  Wen-Hann Wang,et al.  On the inclusion properties for multi-level cache hierarchies , 1988, ISCA '88.

[26]  Qing Yang,et al.  DCD --- Disk Caching Disk: A New Approach for Boosting I/O Performance , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[27]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[28]  Wen-Hann Wang,et al.  On the Inclusion Properties for Multi-Level Cache Hierarchies , 1988, ISCA.

[29]  Nancy Warter-Perez,et al.  Modulo scheduling with multiple initiation intervals , 1995, MICRO 1995.

[30]  Stephen Richardson,et al.  An Equal Area Comparison of Embedded DRAM and SRAM Memory Architectures for a Chip Multiprocessor , 2000 .

[31]  Zhao Zhang,et al.  Cached DRAM for ILP Processor Memory Access Latency Reduction , 2001, IEEE Micro.

[32]  Jean-Loup Baer,et al.  Modified LRU policies for improving second-level cache behavior , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[33]  Hideto Hidaka,et al.  The cache DRAM architecture: a DRAM with an on-chip cache memory , 1990, IEEE Micro.

[34]  Rajeev Balasubramonian,et al.  Dynamically allocating processor resources between nearby and distant ILP , 2001, ISCA 2001.

[35]  James R. Goodman,et al.  A study of instruction cache organizations and replacement policies , 1983, ISCA '83.

[36]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[37]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[38]  Trevor N. Mudge,et al.  A performance comparison of contemporary DRAM architectures , 1999, ISCA.

[39]  Gary S. Tyson,et al.  Eager writeback-a technique for improving bandwidth utilization , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[40]  Stéphan Jourdan,et al.  Speculation techniques for improving load related instruction scheduling , 1999, ISCA.

[41]  Leonid Oliker,et al.  Memory-intensive benchmarks: IRAM vs. cache-based machines , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[42]  Yiming Hu,et al.  DCD—disk caching disk: a new approach for boosting I/O performance , 1996, ISCA '96.

[43]  Alan Jay Smith,et al.  Functional Implementation Techniques for CPU Cache Memories , 1999, IEEE Trans. Computers.

[44]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[45]  Charles A. Hart CDRAM in a unified memory architecture , 1994, Proceedings of COMPCON '94.

[46]  Brad Calder,et al.  A Decoupled Predictor-Directed Stream Prefetching Architecture , 2003, IEEE Trans. Computers.

[47]  Gershon Kedem,et al.  WCDRAM: A fully associative integrated Cached-DRAM with wide cache lines , 1997 .

[48]  Naveen Cherukuri,et al.  The IA-64 Itanium Processor Cartridge , 2001, IEEE Micro.