nZDC: A compiler technique for near Zero Silent Data Corruption

Exponentially growing rate of soft errors makes reliability a major concern in modern processor design. Since software-oriented approaches offer flexible protection even in off-the-shelf processes, they are attractive solutions in protecting against soft errors. Among such approaches, in-application instruction duplication based approaches have been widely used and are deemed to be the most effective. Such techniques duplicate the program assembly instructions and periodically check the results to identify possible errors. Even though early reports suggest that these achieve close to 100% protection from soft errors, we find several gaps in the protection. Existing techniques are unable to protect several important microarchitectural components, as well as a significant fraction of instructions, resulting in Silent Data Corruptions (SDCs). This paper presents nZDC or near Zero silent Data Corruption - an effective instruction duplication based approach to protect programs from soft errors. Extensive fault injection experiments on almost all the unprotected microarchitectural components in simulated ARM Cortex A53, while executing benchmarks from MiBench suite, demonstrate that nZDC is extremely effective, without incurring any more performance penalty than the state-of-the-art.

[1]  Shubu Mukherjee,et al.  Architecture Design for Soft Errors , 2008 .

[2]  Edward J. McCluskey,et al.  ED4I: Error Detection by Diverse Data and Duplicated Instructions , 2002, IEEE Trans. Computers.

[3]  Robert Baumann,et al.  Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.

[4]  Scott Mahlke,et al.  Efficient soft error protection for commodity embedded microprocessors using profile information , 2012, LCTES 2012.

[5]  Sammy Kayali Reliability consideration for advanced microelectronics , 2000, Proceedings. 2000 Pacific Rim International Symposium on Dependable Computing.

[6]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[7]  Tipp Moseley,et al.  PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.

[8]  David I. August,et al.  Design and Evaluation of Hybrid Fault-Detection Systems , 2005, ISCA 2005.

[9]  David I. August,et al.  Software-controlled fault tolerance , 2005, TACO.

[10]  Régis Leveugle,et al.  Statistical fault injection: Quantified error and confidence , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[11]  Marco Torchiano,et al.  Soft-error detection through software fault-tolerance techniques , 1999, Proceedings 1999 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (EFT'99).

[12]  Jing Yu,et al.  ESoftCheck: Removal of Non-vital Checks for Fault Tolerance , 2009, 2009 International Symposium on Code Generation and Optimization.

[13]  Scott A. Mahlke,et al.  Runtime asynchronous fault tolerance via speculation , 2012, CGO '12.

[14]  Sally A. McKee,et al.  ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[15]  Karthikeyan Sankaralingam,et al.  Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[16]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[17]  Vasileios Porpodas,et al.  DRIFT: Decoupled CompileR-Based Instruction-Level Fault-Tolerance , 2013, LCPC.

[18]  Yun Zhang,et al.  DAFT: decoupled acyclic fault tolerance , 2010, PACT '10.

[19]  Peter Hazucha,et al.  Characterization of soft errors caused by single event upsets in CMOS processes , 2004, IEEE Transactions on Dependable and Secure Computing.

[20]  Cheng Wang,et al.  Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[21]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[22]  Marco Torchiano,et al.  A source-to-source compiler for generating dependable software , 2001, Proceedings First IEEE International Workshop on Source Code Analysis and Manipulation.

[23]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.