Detecting Silent Data Corruptions in Aerospace-Based Computing Using Program Invariants

Soft error caused by single event upset has been a severe challenge to aerospace-based computing. Silent data corruption (SDC) is one of the results incurred by soft error. SDC occurs when a program generates erroneous output with no indications. SDC is the most insidious type of results and very difficult to detect. To address this problem, we design and implement an invariant-based system called Radish. Invariants describe certain properties of a program; for example, the value of a variable equals a constant. Radish first extracts invariants at key program points and converts invariants into assertions. It then hardens the program by inserting the assertions into the source code. When a soft error occurs, assertions will be found to be false at run time and warn the users of soft error. To increase the coverage of SDC, we further propose an extension of Radish, named Radish_D, which applies software-based instruction duplication mechanism to protect the uncovered code sections. Experiments using architectural fault injections show that Radish achieves high SDC coverage with very low overhead. Furthermore, Radish_D provides higher SDC coverage than that of either Radish or pure instruction duplication.

[1]  Ravishankar K. Iyer,et al.  Dynamic Derivation of Application-Specific Error Detectors and their Implementation in Hardware , 2006, 2006 Sixth European Dependable Computing Conference.

[2]  D. Binder,et al.  Satellite Anomalies from Galactic Cosmic Rays , 1975, IEEE Transactions on Nuclear Science.

[3]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[4]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[5]  Jeffrey S. Vetter,et al.  A Survey of Techniques for Modeling and Improving Reliability of Computing Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.

[6]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[7]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS 2010.

[8]  Shubhendu S. Mukherjee,et al.  Perturbation-based Fault Screening , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[9]  Muhammad Shafique,et al.  Exploiting program-level masking and error propagation for constrained reliability optimization , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[10]  Karthik Pattabiraman,et al.  LLFI : An Intermediate Code Level Fault Injector For Soft Computing Applications , 2013 .

[11]  Thiago Santini,et al.  Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units , 2016, IEEE Transactions on Computers.

[12]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[13]  J. Olsen,et al.  Neutron-induced single event upsets in static RAMS observed a 10 km flight attitude , 1993 .

[14]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[15]  Meeta Sharma Gupta,et al.  SDCTune: A model for predicting the SDC proneness of an application for configurable protection , 2014, 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[16]  Sanjay J. Patel,et al.  ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2006, IEEE Trans. Dependable Secur. Comput..

[17]  Olaf Spinczyk,et al.  Avoiding Pitfalls in Fault-Injection Based Comparison of Program Susceptibility to Soft Errors , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[18]  Muhammad Shafique,et al.  Leveraging variable function resilience for selective software reliability on unreliable hardware , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[19]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[20]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[21]  Karthik Pattabiraman,et al.  Error detector placement for soft computation , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[22]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[23]  Stephen McCamant,et al.  The Daikon system for dynamic detection of likely invariants , 2007, Sci. Comput. Program..

[24]  John Paul Walters,et al.  A practical characterization of a NASA SpaceCube application through fault emulation and laser testing , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).