Accurate Model for Application Failure Due to Transient Faults in Caches

To select an appropriate level of error protection in caches, the impact of various protection schemes on the cache Failure In Time (FIT) rate must be evaluated for a target benchmark suite. However, while many simulation tools exist to evaluate area, power and performance for a set of benchmark programs, there is a dearth of such tools for reliability. This paper introduces a new cache reliability model called PARMA+ that has unique features which distinguish it from previous models. PARMA+ estimates a cache's FIT rate in the presence of spatial multi-bit faults, single-bit faults, temporal multi-bit faults and different error protection schemes including parity, ECC, early write-back and bit-interleaving. We first develop the model formally, then we demonstrate its accuracy. We have run reliability simulations for many distributions of large and small fault patterns and have compared them with accelerated fault injection simulations. PARMA+ has high accuracy and low computational complexity.

[1]  Alan Wood,et al.  The impact of new technology on soft error rates , 2011, 2011 International Reliability Physics Symposium.

[2]  David R. Kaeli,et al.  Using hardware vulnerability factors to enhance AVF analysis , 2010, ISCA.

[3]  Mehdi Baradaran Tahoori,et al.  Vulnerability Analysis of L2 Cache Elements to Single Event Upsets , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[4]  T. Mudge,et al.  Drowsy caches: simple techniques for reducing leakage power , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[5]  Nhon Quach,et al.  High Availability and Reliability in the Itanium Processor , 2000, IEEE Micro.

[6]  Graham M. Seed An introduction to object-oriented programming in C++ - with applications in computer graphics , 1996 .

[7]  Maxim Finkelstein,et al.  Failure Rate Modelling for Reliability and Risk , 2008 .

[8]  Norman P. Jouppi,et al.  CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[9]  Janak H. Patel,et al.  Reliability of scrubbing recovery-techniques for memory systems , 1990 .

[10]  Timothy A. Budd,et al.  An introduction to object-oriented programming , 1991 .

[11]  Michel Dubois,et al.  Soft error benchmarking of L2 caches with PARMA , 2011, SIGMETRICS 2011.

[12]  Xiaodong Li,et al.  SoftArch: an architecture-level tool for modeling and analyzing soft errors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[13]  Shuai Wang,et al.  On the Characterization and Optimization of On-Chip Cache Reliability against Soft Errors , 2009, IEEE Transactions on Computers.

[14]  Doe Hyun Yoon,et al.  Memory mapped ECC: low-cost error protection for last level caches , 2009, ISCA '09.

[15]  Michel Dubois,et al.  CPPC: Correctable parity protected cache , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[16]  Michel Dubois,et al.  MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[17]  MPhil PhD MIAP Graham M. Seed BEng,et al.  An Introduction to Object-Oriented Programming in C++ , 2001, Springer London.

[18]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[19]  Alan D. George,et al.  SCIPS: An emulation methodology for fault injection in processor caches , 2011, 2011 Aerospace Conference.

[20]  Tryggve Fossum,et al.  Cache scrubbing in microprocessors: myth or necessity? , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[21]  H.S. Kim,et al.  Device-Orientation Effects on Multiple-Bit Upset in 65 nm SRAMs , 2008, IEEE Transactions on Nuclear Science.

[22]  Sanghyeon Baeg,et al.  SRAM Interleaving Distance Selection With a Soft Error Failure Model , 2009, IEEE Transactions on Nuclear Science.

[23]  Mehdi Baradaran Tahoori,et al.  Reducing Data Cache Susceptibility to Soft Errors , 2006, IEEE Transactions on Dependable and Secure Computing.

[24]  Arijit Biswas,et al.  Computing architectural vulnerability factors for address-based structures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[25]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[26]  J. Maiz,et al.  Characterization of multi-bit soft error events in advanced SRAMs , 2003, IEEE International Electron Devices Meeting 2003.

[27]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[28]  Ram Huggahalli,et al.  Impact of Cache Coherence Protocols on the Processing of Network Traffic , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[29]  David R. Kaeli,et al.  Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.