BACH: A Bandwidth-Aware Hybrid Cache Hierarchy Design with Nonvolatile Memories

Limited main memory bandwidth is becoming a fundamental performance bottleneck in chipmultiprocessor (CMP) design. Yet directly increasing the peak memory bandwidth can incur high cost and power consumption. In this paper, we address this problem by proposing a memory, a bandwidth-aware reconfigurable cache hierarchy, BACH, with hybrid memory technologies. Components of our BACH design include a hybrid cache hierarchy, a reconfiguration mechanism, and a statistical prediction engine. Our hybrid cache hierarchy chooses different memory technologies with various bandwidth characteristics, such as spin-transfer torque memory (STT-MRAM), resistive memory (ReRAM), and embedded DRAM (eDRAM), to configure each level so that the peak bandwidth of the overall cache hierarchy is optimized. Our reconfiguration mechanism can dynamically adjust the cache capacity of each level based on the predicted bandwidth demands of running workloads. The bandwidth prediction is performed by our prediction engine. We evaluate the system performance gain obtained by BACH design with a set of multithreaded and multiprogrammed workloads with and without the limitation of system power budget. Compared with traditional SRAM-based cache design, BACH improves the system throughput by 58% and 14% with multithreaded and multiprogrammed workloads respectively.

[1]  M. Hosomi,et al.  A novel nonvolatile memory with spin torque transfer magnetization switching: spin-ram , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[2]  Xiaoxia Wu,et al.  Exploration of 3D stacked L2 cache design for high performance and efficient thermal control , 2009, ISLPED.

[3]  N. Gura,et al.  UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[4]  Yuan Xie,et al.  Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Vijayalakshmi Srinivasan,et al.  Enhancing lifetime and security of PCM-based Main Memory with Start-Gap Wear Leveling , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Yiran Chen,et al.  A novel architecture of the 3D stacked MRAM L2 cache for CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[7]  Kinam Kim,et al.  Highly manufacturable high density phase change memory of 64Mb and beyond , 2004, IEDM Technical Digest. IEEE International Electron Devices Meeting, 2004..

[8]  Luca Benini,et al.  An efficient distributed memory interface for many-core platform with 3D stacked DRAM , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[9]  N. Shimomura,et al.  Impact of ultra low power and fast write operation of advanced perpendicular MTJ on power reduction for high-performance mobile CPU , 2012, 2012 International Electron Devices Meeting.

[10]  Tao Zhang,et al.  MorphCache: A Reconfigurable Adaptive Multi-level Cache hierarchy , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[11]  Ruhi Sarikaya,et al.  Predicting Program Behavior Based On Objective Function Minimization , 2007, 2007 IEEE 10th International Symposium on Workload Characterization.

[12]  Engin Ipek,et al.  Dynamically replicated memory: building reliable systems from nanoscale resistive memories , 2010, ASPLOS XV.

[13]  Stanley F. Chen,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[14]  Kinam Kim,et al.  Bi-layered RRAM with unlimited endurance and extremely uniform switching , 2011, 2011 Symposium on VLSI Technology - Digest of Technical Papers.

[15]  Krisztián Flautner,et al.  PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor , 2006, ASPLOS XII.

[16]  Ruhi Sarikaya,et al.  Runtime workload behavior prediction using statistical metric modeling with application to dynamic power management , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[17]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[18]  Young-Hyun Jun,et al.  A 1.2 V 12.8 GB/s 2 Gb Mobile Wide-I/O DRAM With 4 $\times$ 128 I/Os Using TSV Based Stacking , 2011, IEEE Journal of Solid-State Circuits.

[19]  Frederick T. Chen,et al.  Evidence and solution of over-RESET problem for HfOX based resistive memory with sub-ns switching speed and high endurance , 2010, 2010 International Electron Devices Meeting.

[20]  Brian Rogers,et al.  Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[21]  Cong Xu,et al.  Moguls: A model to explore the memory hierarchy for bandwidth improvements , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[22]  Yuan Xie,et al.  Design space exploration for 3D architectures , 2006, JETC.

[23]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[24]  Siddharth Gaba,et al.  Nanoscale resistive memory with intrinsic diode characteristics and long endurance , 2010 .

[25]  Jaehyuk Huh,et al.  Exploring the design space of future CMPs , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[26]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[27]  Sandhya Dwarkadas,et al.  Characterizing and predicting program behavior and its variability , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[28]  Sanjeev Kumar,et al.  Dynamic tracking of page miss ratio curve for memory management , 2004, ASPLOS XI.

[29]  Chenjie Yu,et al.  Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms , 2010, Design Automation Conference.

[30]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[32]  Karin Strauss,et al.  Use ECP, not ECC, for hard failures in resistive memories , 2010, ISCA.

[33]  M. Facchini,et al.  Stackable memory of 3D chip integration for mobile applications , 2008, 2008 IEEE International Electron Devices Meeting.

[34]  James R. Goodman,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[35]  H. Noguchi,et al.  Progress of STT-MRAM technology and the effect on normally-off computing systems , 2012, 2012 International Electron Devices Meeting.

[36]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[37]  Kaustav Banerjee,et al.  A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[38]  Norman P. Jouppi,et al.  FREE-p: Protecting non-volatile memory against both hard and soft errors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[39]  Hsien-Hsin S. Lee,et al.  SAFER: Stuck-At-Fault Error Recovery for Memories , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[40]  Luan Tran,et al.  45nm low power CMOS logic compatible embedded STT MRAM utilizing a reverse-connection 1T/1MTJ cell , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[41]  Martin Burtscher,et al.  Bridging the processor-memory performance gap with 3D IC technology , 2005, IEEE Design & Test of Computers.

[42]  Hsien-Hsin S. Lee,et al.  An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[43]  L. Goux,et al.  Ultralow sub-500nA operating current high-performance TiN\Al2O3\HfO2\Hf\TiN bipolar RRAM achieved through understanding-based stack-engineering , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[44]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[45]  Yan Solihin,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[46]  Shih-Hung Chen,et al.  Phase-change random access memory: A scalable technology , 2008, IBM J. Res. Dev..

[47]  Young-Hyun Jun,et al.  A 1.2V 12.8GB/s 2Gb mobile Wide-I/O DRAM with 4×128 I/Os using TSV-based stacking , 2011, 2011 IEEE International Solid-State Circuits Conference.

[48]  H. Noguchi,et al.  Novel hybrid DRAM/MRAM design for reducing power of high performance mobile CPU , 2012, 2012 International Electron Devices Meeting.

[49]  Sally A. McKee,et al.  Reflections on the memory wall , 2004, CF '04.

[50]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[51]  Onur Mutlu,et al.  Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management , 2012, IEEE Computer Architecture Letters.

[52]  Hsien-Hsin S. Lee,et al.  Security refresh: prevent malicious wear-out and increase durability for phase-change memory with dynamically randomized address mapping , 2010, ISCA.

[53]  E. Belhaire,et al.  Macro-model of Spin-Transfer Torque based Magnetic Tunnel Junction device for hybrid Magnetic-CMOS design , 2006, 2006 IEEE International Behavioral Modeling and Simulation Workshop.

[54]  Yuan Xie,et al.  Cost-aware three-dimensional (3D) many-core multiprocessor design , 2010, Design Automation Conference.

[55]  K. Saban Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity , Bandwidth , and Power Efficiency , 2009 .

[56]  Guangyu Sun Moguls: A Model to Explore the Memory Hierarchy for Throughput Computing , 2014 .

[57]  L. Goux,et al.  Dynamic ‘hour glass’ model for SET and RESET in HfO2 RRAM , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[58]  Patrick Dorsey Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity, Bandwidth, and Power Efficiency , 2010 .

[59]  T. Mudge,et al.  Drowsy caches: simple techniques for reducing leakage power , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[60]  Xiaoxia Wu,et al.  Hybrid cache architecture with disparate memory technologies , 2009, ISCA '09.

[61]  Norman P. Jouppi,et al.  Reconfigurable caches and their application to media processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).