Rethinking error injection for effective resilience

Soft errors, caused by radiation, have become a major challenge in today's computer systems and networking equipment, making it imperative that systems be designed to be resilient to errors. Error injection is a powerful approach to evaluate system resilience, and current practice is to inject errors in architectural registers of processors, program variables of applications, or storage elements in the hardware model. This paper, using answers to frequently asked questions, discusses the need for rethinking conventional approaches to error injection, showing data from recent research and our simulation results. Approaches to improving current error injections are also suggested.

[1]  Todd M. Austin,et al.  CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework , 2008, 2008 IEEE International Conference on Computer Design.

[2]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[3]  Prabhakar Kudva,et al.  Soft-error resilience of the IBM POWER6 processor , 2008, IBM J. Res. Dev..

[4]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[5]  R. Allmon,et al.  Soft Error Susceptibilities of 22 nm Tri-Gate Devices , 2012, IEEE Transactions on Nuclear Science.

[6]  Mahmut T. Kandemir,et al.  Object duplication for improving reliability , 2006, Asia and South Pacific Conference on Design Automation, 2006..

[7]  Chen Chang,et al.  BEE3: Revitalizing Computer Architecture Research , 2009 .

[8]  W. H. Robinson,et al.  Fault Simulation and Emulation Tools to Augment Radiation-Hardness Assurance Testing , 2013, IEEE Transactions on Nuclear Science.

[9]  Dhiraj K. Pradhan,et al.  Fault Injection: A Method for Validating Computer-System Dependability , 1995, Computer.

[10]  Johan Karlsson,et al.  Evaluation of error detection schemes using fault injection by heavy-ion radiation , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[11]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[12]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[13]  Ravishankar K. Iyer,et al.  Error Behavior Comparison of Multiple Computing Systems: A Case Study Using Linux on Pentium, Solaris on SPARC, and AIX on POWER , 2008, 2008 14th IEEE Pacific Rim International Symposium on Dependable Computing.

[14]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[15]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Régis Leveugle,et al.  Statistical fault injection: Quantified error and confidence , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[17]  Joseph A. Catania Soft Errors in Electronic Memory – A White Paper , 2022 .

[18]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[19]  Ravishankar K. Iyer,et al.  Measurement-based analysis of fault and error sensitivities of dynamic memory , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[20]  Ravishankar K. Iyer,et al.  Hierarchical Simulation Approach to Accurate Fault Modeling for System Dependability Evaluation , 1999, IEEE Trans. Software Eng..

[21]  Ravishankar K. Iyer,et al.  Automated Derivation of Application-Specific Error Detectors Using Dynamic Analysis , 2011, IEEE Transactions on Dependable and Secure Computing.

[22]  Yun Zhang,et al.  DAFT: Decoupled Acyclic Fault Tolerance , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[23]  Shubhendu S. Mukherjee,et al.  Perturbation-based Fault Screening , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.