CHITIN: A Comprehensive In-thread Instruction Replication Technique Against Transient Faults

Soft errors have become one of the most important design concerns due to drastic technology scaling. Software-based error detection techniques are attractive, due to their flexibility and hardware independence. However, our in-depth analysis reveals that the state-of-the-art techniques in the area cannot provide comprehensive fault coverage: i) their control-flow protection schemes provide incomplete redundancy of original instructions, ii) they do not protect function calls and returns, and iii) their instruction scheduling leaves many vulnerabilities open. In this paper, we propose CHITIN - code transformations for soft error resilience that adopts the load-back checking scheme of nZDC, an improved version of SWIFT-like control-flow protection scheme, and a contiguous scheduling of the original and redundant instructions to dramatically improve the vulnerability from soft errors that disrupt the control-flow. Our fault injection experiments demonstrate that CHITIN can reduce more than 89% of the silent data corruptions in the state-of-the-art solutions.

[1]  Scott A. Mahlke,et al.  Harnessing Soft Computations for Low-Budget Fault Tolerance , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[2]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[3]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[4]  Aviral Shrivastava,et al.  NEMESIS: A software approach for computing in presence of soft errors , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[5]  E. Ibe,et al.  Impact of Scaling on Neutron-Induced Soft Error in SRAMs From a 250 nm to a 22 nm Design Rule , 2010, IEEE Transactions on Electron Devices.

[6]  Mikko H. Lipasti,et al.  Silent Stores and Store Value Locality , 2001, IEEE Trans. Computers.

[7]  Aviral Shrivastava,et al.  EXPERT: Effective and flexible error protection by redundant multithreading , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[8]  Vasileios Porpodas,et al.  DRIFT: Decoupled CompileR-Based Instruction-Level Fault-Tolerance , 2013, LCPC.

[9]  Amr Haggag,et al.  Mitigating “No trouble found” component returns , 2015, 2015 IEEE International Reliability Physics Symposium.

[10]  Lloyd W. Massengill,et al.  Basic mechanisms and modeling of single-event upset in digital microelectronics , 2003 .

[11]  Shidhartha Das,et al.  A Triple Core Lock-Step (TCLS) ARM® Cortex®-R5 Processor for Safety-Critical and Ultra-Reliable Applications , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W).

[12]  David I. August,et al.  Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.

[13]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[14]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[15]  Scott Mahlke,et al.  Path Sensitive Signatures for Control Flow Error Detection , 2020, LCTES.

[16]  R. Baumann Soft errors in advanced semiconductor devices-part I: the three radiation sources , 2001 .

[17]  Amr Haggag,et al.  Reliability/yield trade-off in mitigating “no trouble found” field returns , 2015, 2015 IEEE 21st International On-Line Testing Symposium (IOLTS).

[18]  Scott A. Mahlke,et al.  Efficient soft error protection for commodity embedded microprocessors using profile information , 2012, LCTES '12.

[19]  Martin Schulz,et al.  IPAS: Intelligent protection against silent output corruption in scientific applications , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[20]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[21]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[22]  Michael G. Pecht,et al.  No-fault-found and intermittent failures in electronic products , 2008, Microelectron. Reliab..

[23]  Régis Leveugle,et al.  Statistical fault injection: Quantified error and confidence , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[24]  Olaf Spinczyk,et al.  Avoiding Pitfalls in Fault-Injection Based Comparison of Program Susceptibility to Soft Errors , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[25]  Jing Yu,et al.  ESoftCheck: Removal of Non-vital Checks for Fault Tolerance , 2009, 2009 International Symposium on Code Generation and Optimization.

[26]  Aviral Shrivastava,et al.  nZDC: A compiler technique for near Zero Silent Data Corruption , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[27]  Vilas Sridharan,et al.  A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.