Protecting Caches from Soft Errors

Soft error is one of the most important design concerns in modern embedded systems with aggressive technology scaling. Among various microarchitectural components in a processor, cache is the most susceptible component to soft errors. Error detection and correction codes are common protection techniques for cache memory due to their design simplicity. In order to design effective protection techniques for caches, it is important to quantitatively estimate the susceptibility of caches without and even with protections. At the architectural level, vulnerability is the metric to quantify the susceptibility of data in caches. However, existing tools and techniques calculate the vulnerability of data in caches through coarse-grained block-level estimation. Further, they ignore common cache protection techniques such as error detection and correction codes. In this article, we demonstrate that our word-level vulnerability estimation is accurate through intensive fault injection campaigns as compared to block-level one. Further, our extensive experiments over benchmark suites reveal several counter-intuitive and interesting results. Parity checking when performed over just reads provides reliable and power-efficient protection than that when performed over both reads and writes. On the other hand, checking error correcting codes only at reads alone can be vulnerable even for single-bit soft errors, while that at both reads and writes provides the perfect reliability.

[1]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[2]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[3]  Yunheung Paek,et al.  Software-Based Selective Validation Techniques for Robust CGRAs Against Soft Errors , 2016, TECS.

[4]  Daniel J. Sorin,et al.  Choosing an Error Protection Scheme for a Microprocessor's L1 Data Cache , 2006, 2006 International Conference on Computer Design.

[5]  Mohan J. Kumar Advanced Reliability for Intel ® Xeon ® Processor-based Servers With an array of new reliability, availability, and serviceability (RAS) features, the Intel® Xeon® processor 7500 series offers exceptional data integrity and resilience for mission-critical computing environments. , 2010 .

[6]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[7]  Aviral Shrivastava,et al.  Mitigating soft error failures for multimedia applications by selective data protection , 2006, CASES '06.

[8]  Mehdi Baradaran Tahoori,et al.  A Field Analysis of System-level Effects of Soft Errors Occurring in Microprocessors used in Information Systems , 2008, 2008 IEEE International Test Conference.

[9]  N. Seifert,et al.  Robust system design with built-in soft-error resilience , 2005, Computer.

[10]  Jeffrey S. Vetter,et al.  Reducing soft-error vulnerability of caches using data compression , 2016, 2016 International Great Lakes Symposium on VLSI (GLSVLSI).

[11]  Aviral Shrivastava,et al.  Smart cache cleaning: Energy efficient vulnerability reduction in embedded processors , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[12]  Jeffrey T. Draper,et al.  Critical Charge Characterization for Soft Error Rate Modeling in 90nm SRAM , 2007, 2007 IEEE International Symposium on Circuits and Systems.

[13]  Wei Zhang,et al.  Computing cache vulnerability to transient errors and its implication , 2005, 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'05).

[14]  Cameron McNairy,et al.  Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.

[15]  Prabhakar Kudva,et al.  Fault Injection Verification of IBM POWER 6 Soft Error Resilience , 2007 .

[16]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[17]  Gabriel L. Nazar,et al.  Live-Out Register Fencing , 2016, ACM Trans. Embed. Comput. Syst..

[18]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[19]  Mahmut T. Kandemir,et al.  Soft error and energy consumption interactions: a data cache perspective , 2004, Proceedings of the 2004 International Symposium on Low Power Electronics and Design (IEEE Cat. No.04TH8758).

[20]  Alan Wood,et al.  The impact of new technology on soft error rates , 2011, 2011 International Reliability Physics Symposium.

[21]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[22]  Mehdi Baradaran Tahoori,et al.  Balancing Performance and Reliability in the Memory Hierarchy , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[23]  Michel Dubois,et al.  CPPC: Correctable parity protected cache , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[24]  Aviral Shrivastava,et al.  Guidelines to design parity protected write-back L1 data cache , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[25]  Mario García-Valderas,et al.  Soft Error Sensitivity Evaluation of Microprocessors by Multilevel Emulation-Based Fault Injection , 2012, IEEE Transactions on Computers.

[26]  An-Chang Deng,et al.  The design and implementation of PowerMill , 1995, ISLPED '95.