Modeling the Propagation of Soft Errors in Programs

Due to the continuous growth of transistors in processors, the cloud computing paradigm and increasingly complex environment, soft errors, which are a category of typical transient errors, have become an urgent challenge in ground-level systems. To handle these errors, making clear the error propagation is the key step. As programs become large, we cannot employ traditional fault injection campaigns to monitor every possible error. This paper proposes a method to study and model the propagation of soft errors within a program. Based on dynamic instructions traced in an error-free program execution, the ACE analysis classifies soft errors in architectural registers into benign ones and non-benign ones. The benign errors are considered to be derated in the propagation. Then we build a crash model and an improved DDG to analyze the propagation of each non-benign error and to predict its consequence (crash or silent data corruption). If the error is considered to cause a crash, the crash latency and the propagation path are also predicted. The method can be used to predict outcomes of programs under soft errors as well as occurrence of correct outputs, silent data corruptions or crashes. Extensive fault-injection experiments are provided to validate the proposed method from multiple perspectives.

[1]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[2]  N. Seifert,et al.  Robust system design with built-in soft-error resilience , 2005, Computer.

[3]  David R. Kaeli,et al.  Eliminating microarchitectural dependency from Architectural Vulnerability , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[4]  Lloyd W. Massengill,et al.  Basic mechanisms and modeling of single-event upset in digital microelectronics , 2003 .

[5]  Jean Arlat,et al.  Analysis of the effects of real and injected software faults: Linux as a case study , 2002, 2002 Pacific Rim International Symposium on Dependable Computing, 2002. Proceedings..

[6]  Ravishankar K. Iyer,et al.  Application-based metrics for strategic placement of detectors , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[7]  Carla E. Brodley,et al.  Heat stroke: power-density-based denial of service in SMT , 2005, 11th International Symposium on High-Performance Computer Architecture.

[8]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[9]  Ravishankar K. Iyer,et al.  FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior under Faults , 1993, IEEE Trans. Software Eng..

[10]  Xin Xu,et al.  Understanding soft error propagation using Efficient vulnerability-driven fault injection , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[11]  Arijit Biswas,et al.  Computing architectural vulnerability factors for address-based structures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[12]  Bogdan Korel,et al.  Dynamic program slicing in understanding of program execution , 1997, Proceedings Fifth International Workshop on Program Comprehension. IWPC'97.

[13]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[14]  Ravishankar K. Iyer,et al.  Characterization of linux kernel behavior under errors , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[15]  K. Soumyanath,et al.  Scaling trends of cosmic ray induced soft errors in static latches beyond 0.18 /spl mu/ , 2001, 2001 Symposium on VLSI Circuits. Digest of Technical Papers (IEEE Cat. No.01CH37185).

[16]  R. Baumann Soft errors in advanced semiconductor devices-part I: the three radiation sources , 2001 .

[17]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[18]  Todd M. Austin,et al.  Dynamic dependency analysis of ordinary programs , 1992, ISCA '92.

[19]  Takashi Nanya,et al.  Zapmem: A Framework for Testing the Effect of Memory Corruption Errors on Operating System Kernel Reliability , 2009, 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing.

[20]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[21]  Ravishankar K. Iyer,et al.  CloudVal: A framework for validation of virtualization environment in cloud infrastructure , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[22]  Karthik Pattabiraman,et al.  Modeling the Propagation of Intermittent Hardware Faults in Programs , 2010, 2010 IEEE 16th Pacific Rim International Symposium on Dependable Computing.

[23]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[24]  H. Howie Huang,et al.  On Soft Error Reliability of Virtualization Infrastructure , 2016, IEEE Transactions on Computers.

[25]  Sanjay J. Patel,et al.  Examining ACE analysis reliability estimates using fault-injection , 2007, ISCA '07.

[26]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[27]  Ravishankar K. Iyer,et al.  Error Behavior Comparison of Multiple Computing Systems: A Case Study Using Linux on Pentium, Solaris on SPARC, and AIX on POWER , 2008, 2008 14th IEEE Pacific Rim International Symposium on Dependable Computing.

[28]  Farshad Firouzi,et al.  Instruction reliability analysis for embedded processors , 2010, 13th IEEE Symposium on Design and Diagnostics of Electronic Circuits and Systems.