Hybrid cache architecture with disparate memory technologies

Caching techniques have been an efficient mechanism for mitigating the effects of the processor-memory speed gap. Traditional multi-level SRAM-based cache hierarchies, especially in the context of chip multiprocessors (CMPs), present many challenges in area requirements, core-to-cache balance, power consumption, and design complexity. New advancements in technology enable caches to be built from other technologies, such as Embedded DRAM (EDRAM), Magnetic RAM (MRAM), and Phase-change RAM (PRAM), in both 2D chips or 3D stacked chips. Caches fabricated in these technologies offer dramatically different power and performance characteristics when compared with SRAM-based caches, particularly in the areas of access latency, cell density, and overall power consumption. In this paper, we propose to take advantage of the best characteristics that each technology offers, through the use of Hybrid Cache Architecture (HCA) designs. We discuss and evaluate two types of hybrid cache architectures: inter cache Level HCA (LHCA), in which the levels in a cache hierarchy can be made of disparate memory technologies; and intra cache level or cache Region based HCA (RHCA), where a single level of cache can be partitioned into multiple regions, each of a different memory technology. We have studied a number of different HCA architectures and explored the potential of hardware support for intra-cache data movement and power consumption management within HCA caches. Utilizing a full-system simulator that has been validated against real hardware, we demonstrate that an LHCA design can provide a geometric mean 7% IPC improvement over a baseline 3-level SRAM cache design under the same area constraint across a collection of 25 workloads. A more aggressive RHCA-based design provides 12% IPC improvement over the baseline. Finally, a 2-layer 3D cache stack (3DHCA) of high density memory technology within the same chip footprint gives 18% IPC improvement over the baseline. Furthermore, up to 70% reduction in power consumption over a baseline SRAM-only design is achieved.

[1]  Martin Burtscher,et al.  Bridging the processor-memory performance gap with 3D IC technology , 2005, IEEE Design & Test of Computers.

[2]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[3]  Balaram Sinharoy,et al.  POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[4]  R. Morin,et al.  A multi-level comparative performance characterization of SPECjbb2005 versus SPECjbb2000 , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[5]  F. Pellizzer,et al.  Novel /spl mu/trench phase-change memory cell for embedded and stand-alone non-volatile memory applications , 2004, Digest of Technical Papers. 2004 Symposium on VLSI Technology, 2004..

[6]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[7]  David A. Bader,et al.  BioPerf: a benchmark suite to evaluate high-performance computer architecture on bioinformatics applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[8]  Lei Jiang,et al.  Die Stacking (3D) Microarchitecture , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[9]  Jian Xu,et al.  Demystifying 3D ICs: the pros and cons of going vertical , 2005, IEEE Design & Test of Computers.

[10]  Yuan Xie,et al.  Design space exploration for 3D architectures , 2006, JETC.

[11]  Naoki Kitai,et al.  A 512kB Embedded Phase Change Memory with 416kB/s Write Throughput at 100μA Cell Write Current , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[12]  Yuan Xie,et al.  System-level cost analysis and design exploration for three-dimensional integrated circuits (3D ICs) , 2009, 2009 Asia and South Pacific Design Automation Conference.

[13]  Rajeev Balasubramonian,et al.  Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[14]  Chung Lam Cell Design Considerations for Phase Change Memory as a Universal Memory , 2008, 2008 International Symposium on VLSI Technology, Systems and Applications (VLSI-TSA).

[15]  Yiran Chen,et al.  A novel architecture of the 3D stacked MRAM L2 cache for CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[16]  David Blaauw,et al.  Drowsy caches: simple techniques for reducing leakage power , 2002, ISCA.

[17]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[18]  Yiran Chen,et al.  Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[19]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[20]  J. Joyner,et al.  Opportunities for reduced power dissipation using three-dimensional integration , 2002, Proceedings of the IEEE 2002 International Interconnect Technology Conference (Cat. No.02EX519).

[21]  Xiaoxia Wu,et al.  Power and performance of read-write aware Hybrid Caches with non-volatile memories , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[22]  Jaehyuk Huh,et al.  A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[23]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[24]  M. Hosomi,et al.  A novel nonvolatile memory with spin torque transfer magnetization switching: spin-ram , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[25]  E. Belhaire,et al.  Macro-model of Spin-Transfer Torque based Magnetic Tunnel Junction device for hybrid Magnetic-CMOS design , 2006, 2006 IEEE International Behavioral Modeling and Simulation Workshop.

[26]  Lixin Zhang,et al.  Mambo: a full system simulator for the PowerPC architecture , 2004, PERV.

[27]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[28]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[29]  Mahmut T. Kandemir,et al.  Design and Management of 3D Chip Multiprocessors Using Network-in-Memory , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).