MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets

Due to the growing trend that a Single Event Upset (SEU) can cause spatial Multi-Bit Upsets (MBUs), the effects of spatial MBUs has recently become an important yet very challenging issue, especially in large, last-level caches (LLCs) protected by protection codes. In the presence of spatial MBUs, the strength of the protection codes becomes a critical design issue. Developing a reliability model that includes the cumulative effects of overlapping SBUs, temporal MBUs and spatial MBUs is a very challenging problem, especially when protection codes are active. In this paper, we introduce a new framework called MACAU. MACAU is based on a Markov chain model and can compute the intrinsic MTTFs of scrubbed caches as well as benchmark caches protected by various codes. MACAU is the first framework that quantifies the failure rates of caches due to the combined effects of SBUs, temporal MBUs and spatial MBUs.

[1]  Carl D. Meyer,et al.  An alternative expression for the mean first passage matrix , 1978 .

[2]  Janak H. Patel,et al.  Reliability of scrubbing recovery-techniques for memory systems , 1990 .

[3]  Mehmet Sahinoglu,et al.  Compound-Poisson Software Reliability Model , 1992, IEEE Trans. Software Eng..

[4]  Charles M. Grinstead,et al.  Introduction to probability , 1999, Statistics for the Behavioural Sciences.

[5]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[6]  Daniel M. Gordon,et al.  A Survey of Fast Exponentiation Methods , 1998, J. Algorithms.

[7]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[8]  T. Mudge,et al.  Drowsy caches: simple techniques for reducing leakage power , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[9]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[10]  Tryggve Fossum,et al.  Cache scrubbing in microprocessors: myth or necessity? , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[11]  Arijit Biswas,et al.  Computing architectural vulnerability factors for address-based structures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[12]  Mehdi Baradaran Tahoori,et al.  Vulnerability Analysis of L2 Cache Elements to Single Event Upsets , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[13]  S. S. Chung,et al.  Spreading Diversity in Multi-cell Neutron-Induced Upsets with Device Scaling , 2006, IEEE Custom Integrated Circuits Conference 2006.

[14]  Jeffrey T. Draper,et al.  Critical Charge Characterization for Soft Error Rate Modeling in 90nm SRAM , 2007, 2007 IEEE International Symposium on Circuits and Systems.

[15]  A.F. Witulski,et al.  Models and Algorithmic Limits for an ECC-Based Approach to Hardening Sub-100-nm SRAMs , 2007, IEEE Transactions on Nuclear Science.

[16]  E. Amirante,et al.  Investigation of Increased Multi-Bit Failure Rate Due to Neutron Induced SEU in Advanced Embedded SRAMs , 2007, 2007 IEEE Symposium on VLSI Circuits.

[17]  N. Seifert,et al.  Multi-cell upset probabilities of 45nm high-k + metal gate SRAM devices in terrestrial and space environments , 2008, 2008 IEEE International Reliability Physics Symposium.

[18]  H.S. Kim,et al.  Device-Orientation Effects on Multiple-Bit Upset in 65 nm SRAMs , 2008, IEEE Transactions on Nuclear Science.

[19]  Sarita V. Adve,et al.  Accurate microarchitecture-level fault modeling for studying hardware faults , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[20]  Sanghyeon Baeg,et al.  SRAM Interleaving Distance Selection With a Soft Error Failure Model , 2009, IEEE Transactions on Nuclear Science.

[21]  P. Reviriego,et al.  Study of the Effects of Multibit Error Correction Codes on the Reliability of Memories in the Presence of MBUs , 2009, IEEE Transactions on Device and Materials Reliability.

[22]  Michel Dubois,et al.  CPPC: Correctable parity protected cache , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[23]  Zhu Ming,et al.  Reliability of Memories Protected by Multibit Error Correction Codes Against MBUs , 2011, IEEE Transactions on Nuclear Science.

[24]  Michel Dubois,et al.  Soft error benchmarking of L2 caches with PARMA , 2011, SIGMETRICS 2011.

[25]  Bharat Bhuva,et al.  Analysis of multiple cell upsets due to neutrons in SRAMs for a Deep-N-well process , 2011, 2011 International Reliability Physics Symposium.