An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth

Memory bandwidth has become a major performance bottleneck as more and more cores are integrated onto a single die, demanding more and more data from the system memory. Several prior studies have demonstrated that this memory bandwidth problem can be addressed by employing a 3D-stacked memory architecture, which provides a wide, high frequency memory-bus interface. Although previous 3D proposals already provide as much bandwidth as a traditional L2 cache can consume, the dense through-silicon-vias (TSVs) of 3D chip stacks can provide still more bandwidth. In this paper, we contest that we need to re-architect our memory hierarchy, including the L2 cache and DRAM interface, so that it can take full advantage of this massive bandwidth. Our technique, SMART-3D, is a new 3D-stacked memory architecture with a vertical L2 fetch/write-back network using a large array of TSVs. Simply stated, we leverage the TSV bandwidth to hide latency behind very large data transfers. We analyze the design trade-offs for the DRAM arrays, careful enough to avoid compromising the DRAM density because of TSV placement. Moreover, we propose an efficient mechanism to manage the false sharing problem when implementing SMART-3D in a multi-socket system. For single-threaded memory-intensive applications, the SMART-3D architecture achieves speedups from 1.53 to 2.14 over planar designs and from 1.27 to 1.72 over prior 3D designs. We achieve similar speedups for multi-program and multi-threaded workloads on multi-core and multi-socket processors. Furthermore, SMART-3D can even lower the energy consumption in the L2 cache and 3D DRAM for it reduces the total number of row buffer misses.

[1]  Luca P. Carloni,et al.  Photonic NoCs: System-Level Design Exploration , 2009, IEEE Micro.

[2]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[3]  Josep Torrellas,et al.  Share Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates , 1990, ICPP.

[4]  Norman P. Jouppi,et al.  WRL Research Report 93/5: An Enhanced Access and Cycle Time Model for On-chip Caches , 1994 .

[5]  Rajeev Balasubramonian,et al.  Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[6]  Jun Yang,et al.  A low-radix and low-diameter 3D interconnection network design , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[7]  Alan Jay Smith,et al.  Line (Block) Size Choice for CPU Cache Memories , 1987, IEEE Transactions on Computers.

[8]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[9]  Alan Jay Smith,et al.  Sequential Program Prefetching in Memory Hierarchies , 1978, Computer.

[10]  Kathryn S. McKinley,et al.  Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.

[11]  J. ContiC.,et al.  Structural aspects of the system/360 model 85 , 1968 .

[12]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[13]  Sanjay J. Patel,et al.  Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[14]  Hsien-Hsin S. Lee,et al.  Architectural evaluation of 3D stacked RRAM caches , 2009, 2009 IEEE International Conference on 3D System Integration.

[15]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[16]  Jean-Loup Baer,et al.  Two techniques for improving performance on bus-based multiprocessors , 1995, Future Gener. Comput. Syst..

[17]  Per Stenström,et al.  TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors , 2002, ISLPED '02.

[18]  Martin Burtscher,et al.  Bridging the processor-memory performance gap with 3D IC technology , 2005, IEEE Design & Test of Computers.

[19]  Yu Cao,et al.  New paradigm of predictive MOSFET and interconnect modeling for early circuit simulation , 2000, Proceedings of the IEEE 2000 Custom Integrated Circuits Conference (Cat. No.00CH37044).

[20]  Olivier Temam,et al.  MicroLib: A Case for the Quantitative Comparison of Micro-Architecture Mechanisms , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[21]  Alan L. Cox,et al.  Evaluation of release consistent software distributed shared memory on emerging network technology , 1993, ISCA '93.

[22]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[23]  Tao Li,et al.  Microarchitecture soft error vulnerability characterization and mitigation under 3D integration technology , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[24]  Coniferous softwood GENERAL TERMS , 2003 .

[25]  Wei-Fen Lin,et al.  Reducing DRAM latencies with an integrated memory hierarchy design , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[26]  Xiaoxia Wu,et al.  Hybrid cache architecture with disparate memory technologies , 2009, ISCA '09.

[27]  Steven Przybylski The performance impact of block sizes and fetch strategies , 1990, ISCA '90.

[28]  Lei Jiang,et al.  Die Stacking (3D) Microarchitecture , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[29]  Andreas Moshovos RegionScout: exploiting coarse grain sharing in snoop-based coherence , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[30]  Kaustav Banerjee,et al.  A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[31]  Livio Ricciulli,et al.  The detection and elimination of useless misses in multiprocessors , 1993, ISCA '93.

[32]  John S. Liptay,et al.  Structural Aspects of the System/360 Model 85 II: The Cache , 1968, IBM Syst. J..

[33]  Richard E. Matick,et al.  Logic-based eDRAM: Origins and rationale for use , 2005, IBM J. Res. Dev..

[34]  Mikko H. Lipasti,et al.  Stealth prefetching , 2006, ASPLOS XII.

[35]  Yvon Jégou,et al.  Using virtual lines to enhance locality exploitation , 1994, ICS '94.

[36]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[37]  Thomas F. Wenisch,et al.  Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[38]  Krisztián Flautner,et al.  PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor , 2006, ASPLOS XII.

[39]  Rajeev Balasubramonian,et al.  Leveraging 3D Technology for Improved Reliability , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[40]  D. Burger,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[41]  Wen-Hann Wang,et al.  Organization And Performance Of A Two-level Virtual-real Cache Hierarchy , 1989, The 16th Annual International Symposium on Computer Architecture.

[42]  Yiran Chen,et al.  A novel architecture of the 3D stacked MRAM L2 cache for CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[43]  Thomas J. LeBlanc,et al.  Adjustable block size coherent caches , 1992, ISCA '92.

[44]  Chita R. Das,et al.  A novel dimensionally-decomposed router for on-chip communication in 3D architectures , 2007, ISCA '07.

[45]  Hsien-Hsin S. Lee,et al.  POD: A 3D-Integrated Broad-Purpose Acceleration Layer , 2008, IEEE Micro.

[46]  Mahmut T. Kandemir,et al.  Design and Management of 3D Chip Multiprocessors Using Network-in-Memory , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[47]  Michel Dubois,et al.  Cache protocols with partial block invalidations , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[48]  Josep Torrellas,et al.  False Sharing ans Spatial Locality in Multiprocessor Caches , 1994, IEEE Trans. Computers.

[49]  W. H. Wang,et al.  Organization and performance of a two-level virtual-real cache hierarchy , 1989, ISCA '89.