PREMSim: A Resilience Framework for Modeling Traditional and Emerging Memory Reliability

Scaling limitations of conventional and emerging memories has provided the impetus for the increased focus on reliability techniques to overcome associated physical limitations of non-perfect devices. However, despite these reliability advances, critical challenges remain to be solved as new memory types and memory vulnerabilities arise. There continues to be no simulator with extended reliability models for easy comparison of existing and newly developed techniques nor simple integration of innovative new reliability concepts and failure modes. The mission of our simulator, PremSim, is to provide a framework which solves these fundamental limitations. While PremSim can function using memory traces, it was also designed from the ground-up to be fully integrated with several external simulators including the Structural Simulation Toolkit (SST) as a memory backend. It can connect to other detailed memory backends such as DRAMSim2 for detailed energy and timing. Further, it provides modes which give estimated lifetime for endurance-limited memories, as well as the provable correction capability per row for a given fault distribution. To perform these calculations in a reasonable time window and to remain compatible with abstract full-system simulators, we also provide and verify novel abstractions. Additionally, we show case studies of how different fault mitigation strategies can be modeled effectively in PremSim including solutions at the page, row, word, and bit-level granularity of next-generation traditional and emerging faulty memories.

[1]  Alex K. Jones,et al.  Sustainable IC design and fabrication , 2017, 2017 Eighth International Green and Sustainable Computing Conference (IGSC).

[2]  Rami G. Melhem,et al.  Counter-Based Tree Structure for Row Hammering Mitigation in DRAM , 2017, IEEE Computer Architecture Letters.

[3]  Norbert Wehn,et al.  Exploiting expendable process-margins in DRAMs for run-time performance optimization , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[4]  Rami G. Melhem,et al.  Yielding optimized dependability assurance through bit inversion , 2019, Integr..

[5]  Samiha Mourad,et al.  Crosstalk Induced Fault Analysis and Test in DRAMs , 2006, J. Electron. Test..

[6]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[7]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[8]  J. F. Kitchin Practical Markov modeling for reliability analysis , 1988, 1988. Proceedings., Annual Reliability and Maintainability Symposium,.

[9]  Mircea R. Stan,et al.  Relaxing non-volatility for fast and energy-efficient STT-RAM caches , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[10]  Michael J Cullen,et al.  Comparative assessment of life cycle assessment methods used for personal computers. , 2010, Environmental science & technology.

[11]  Norman P. Jouppi,et al.  Rethinking DRAM design and organization for energy-constrained multi-cores , 2010, ISCA.

[12]  Jun Yang,et al.  A durable and energy efficient main memory using phase change memory technology , 2009, ISCA '09.

[13]  Frederick T. Chen,et al.  Highly scalable hafnium oxide memory with improvements of resistive distribution and read disturb immunity , 2009, 2009 IEEE International Electron Devices Meeting (IEDM).

[14]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[15]  Frederick T. Chen,et al.  RRAM Defect Modeling and Failure Analysis Based on March Test and a Novel Squeeze-Search Scheme , 2015, IEEE Transactions on Computers.

[16]  Jiwu Shu,et al.  Aegis: Partitioning data block for efficient recovery of stuck-at-faults in phase change memory , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Charles M. Grinstead,et al.  Introduction to probability , 1999, Statistics for the Behavioural Sciences.

[18]  Yiran Chen,et al.  Emerging non-volatile memories: Opportunities and challenges , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[19]  Jun Yang,et al.  Energy reduction for STT-RAM using early write termination , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[20]  Tao Zhang,et al.  Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[21]  Rami G. Melhem,et al.  Dynamic partitioning to mitigate stuck-at faults in emerging memories , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[22]  Keith D. Underwood,et al.  The Structural Simulation Toolkit: A Tool for Bridging the Ar chitectural/Microarchitectural Evaluation Gap , 2004 .

[23]  Wei Wu,et al.  Improving cache lifetime reliability at ultra-low voltages , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24]  Fabrizio Lombardi,et al.  Markov models of fault-tolerant memory systems under SEU , 2004, Records of the 2004 International Workshop on Memory Technology, Design and Testing, 2004..

[25]  Wei Wu,et al.  Reducing cache power with low-cost, multi-bit error-correcting codes , 2010, ISCA.

[26]  Engin Ipek,et al.  Resistive computation: avoiding the power wall with low-leakage, STT-MRAM based computing , 2010, ISCA.

[27]  Puneet Gupta,et al.  MEMRES: A Fast Memory System Reliability Simulator , 2016, IEEE Transactions on Reliability.

[28]  Philip G. Emma,et al.  Rethinking Refresh: Increasing Availability and Reducing Power in DRAM for Cache Applications , 2008, IEEE Micro.

[29]  Chia-Lin Yang,et al.  SECRET: Selective error correction for refresh energy reduction in DRAMs , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[30]  Onur Mutlu,et al.  PARBOR: An Efficient System-Level Technique to Detect Data-Dependent Failures in DRAM , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[31]  Alex K. Jones,et al.  RETROFIT: Fault-Aware Wear Leveling , 2018, IEEE Computer Architecture Letters.

[32]  Dae-Hyun Kim,et al.  ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates , 2013, ISCA.

[33]  Kaushik Roy,et al.  DWM-TAPESTRI - An energy efficient all-spin cache using domain wall shift based writes , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[34]  Rami G. Melhem,et al.  Yoda: Judge Me by My Size, Do You? , 2017, 2017 IEEE International Conference on Computer Design (ICCD).

[35]  Zaid Al-Ars,et al.  Influence of bit line twisting on the faulty behavior of DRAMs , 2004 .

[36]  Rami G. Melhem,et al.  Reciprocal abstraction for computer architecture co-simulation , 2015, 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[37]  S. Parkin Racetrack memory: A storage class memory based on current controlled magnetic domain wall motion , 2009, 2009 Device Research Conference.

[38]  Rami G. Melhem,et al.  RDIS: A recursively defined invertible set scheme to tolerate multiple stuck-at faults in resistive memory , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[39]  William Turin,et al.  Modeling Error Sources in Digital Channels , 1993, IEEE J. Sel. Areas Commun..

[40]  Marios C. Papaefthymiou,et al.  Block-based multiperiod dynamic memory design for low data-retention power , 2003, IEEE Trans. Very Large Scale Integr. Syst..

[41]  Zaid Al-Ars DRAM fault analysis and test generation , 2005 .

[42]  H.-S. Philip Wong,et al.  Phase Change Memory , 2010, Proceedings of the IEEE.

[43]  Jun Yang,et al.  SD-PCM: Constructing Reliable Super Dense Phase Change Memory under Write Disturbance , 2015, ASPLOS 2015.

[44]  Yiran Chen,et al.  Considering fabrication in sustainable computing , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[45]  Richard Veras,et al.  RAIDR: Retention-aware intelligent DRAM refresh , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[46]  Kaushik Roy,et al.  TapeCache: a high density, energy efficient cache based on domain wall memory , 2012, ISLPED '12.

[47]  Y. Konishi,et al.  Analysis of coupling noise between adjacent bit lines in megabit DRAMs , 1989 .

[48]  S. Parkin,et al.  Magnetic Domain-Wall Racetrack Memory , 2008, Science.

[49]  Paul Teehan,et al.  Comparing embodied greenhouse gas emissions of modern computing and electronics products. , 2013, Environmental science & technology.

[50]  Hyunjin Lee,et al.  Flip-N-Write: A simple deterministic technique to improve PRAM write performance, energy and endurance , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[51]  Haifeng Xu,et al.  Green computing: A life cycle perspective , 2013, 2013 International Green Computing Conference Proceedings.

[52]  Tao Yuan,et al.  Yield Prediction for Integrated Circuits Manufacturing Through Hierarchical Bayesian Modeling of Spatial Defects , 2011, IEEE Transactions on Reliability.

[53]  Karin Strauss,et al.  Use ECP, not ECC, for hard failures in resistive memories , 2010, ISCA.

[54]  Vivek Seshadri,et al.  Simple DRAM and Virtual Memory Abstractions to Enable Highly Efficient Memory Systems , 2016, ArXiv.

[55]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[56]  Prashant J. Nair,et al.  FAULTSIM : A fast , configurable memory-resilience simulator , 2014 .

[57]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[58]  Yu Wang,et al.  Hi-fi playback: Tolerating position errors in shift operations of racetrack memory , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[59]  Stijn Eyerman,et al.  An Evaluation of High-Level Mechanistic Core Models , 2014, ACM Trans. Archit. Code Optim..

[60]  Onur Mutlu,et al.  A case for exploiting subarray-level parallelism (SALP) in DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[61]  Padhraic J. Smyth,et al.  Hidden Markov models for fault detection in dynamic systems , 1993 .

[62]  Chris Fallin,et al.  Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[63]  Rami G. Melhem,et al.  Leveraging Transverse Reads to Correct Alignment Faults in Domain Wall Memories , 2019, 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[64]  Alex K. Jones,et al.  Data Block Partitioning Methods to Mitigate Stuck-At Faults in Limited Endurance Memories , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[65]  Rami G. Melhem,et al.  Mitigating bitline crosstalk noise in DRAM memories , 2017, MEMSYS.

[66]  Heng-Yuan Lee,et al.  Comprehensively study of read disturb immunity and optimal read scheme for high speed HfOx based RRAM with a Ti layer , 2010, Proceedings of 2010 International Symposium on VLSI Technology, System and Application.

[67]  Yu Wang,et al.  PS3-RAM: A fast portable and scalable statistical STT-RAM reliability analysis method , 2012, DAC Design Automation Conference 2012.

[68]  Wei Wu,et al.  Energy-efficient cache design using variable-strength error-correcting codes , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[69]  Alex K. Jones,et al.  Counter Advance for Reliable Encryption in Phase Change Memory , 2018, IEEE Computer Architecture Letters.

[70]  Rami G. Melhem,et al.  Sustainable fault management and error correction for next-generation main memories , 2017, 2017 Eighth International Green and Sustainable Computing Conference (IGSC).

[71]  Kenneth A. Ross,et al.  Navigating big data with high-throughput, energy-efficient data partitioning , 2013, ISCA.

[72]  Chaitali Chakrabarti,et al.  Improving reliability of non-volatile memory technologies through circuit level techniques and error control coding , 2012, EURASIP J. Adv. Signal Process..

[73]  Jun Yang,et al.  On the Restore Time Variations of Future DRAM Memory , 2017, ACM Trans. Design Autom. Electr. Syst..