T HESIS FOR THE D EGREE OF D OCTOR OF P HILOSOPHY On Efficient Measurement of the Impact of Hardware Errors in Computer Systems

Technology and voltage scaling is making integrated circuits increasingly susceptible to failures caused by soft errors . The source of soft errors are temporary hardware faults that alter data and signals in digital circuits. Soft errors are predominately caused by ionizing particles, electrical noise and wear-out effects, but may also occur as a result of marginal circuit designs and manufacturing process variations. Modern computers are equipped with a range of hardware and software based mechanisms for detecting and correcting soft errors, as well as other types of hardware errors. While these mechanisms can handle a variety of errors and error types, protecting a computer completely from the effects of soft errors is technically and economically infeasible. Hence, in applications where reliability and data integrity is of primary concern, it is desirable to assess and measure the system's ability to detect and correct soft errors. This thesis is devoted to the problem of measuring hardware error sensitivity  of computer systems. We define hardware error sensitivity as the probability that a hardware error results in an undetected erroneous output. Since the complexity of computer systems makes it extremely demanding to assess the effectiveness of error handling mechanisms analytically, error sensitivity and related measures, e.g., error coverage, are in practice determined experimentally by means of fault injection experiments . The error sensitivity of a computer system depends not only on the design of its error handling mechanism, but also on the program executed by the computer. In addition, measurements of error sensitivity is affected by the experimental set-up, including how and where the errors are injected, and the assumptions about how soft errors are manifested, i.e., the error model. This thesis identifies and investigates six parameters, or sources of variation, that affect measurements of error sensitivity. These parameters consist of two subgroups, those that deal with systems characteristics, namely, (i) the input processed by a program, (ii) the program's source code implementation, (iii) the level of compiler optimization; and those that deal with measurement setup, namely, (iv) the number of bits that are targeted in each experiment, (v) the target location in which faults are injected, (vi) the time of injection. To accurately measure the error sensitivity of a system, one needs to conduct several sets of fault injection experiments by varying different sources of variations. As these experiments are quite time-consuming, it is desirable to improve the efficiency of fault injection-based measurement of error sensitivity. To this end, the thesis proposes and evaluates different error space optimization and error space pruning techniques to reduce the time and effort needed to measure the error sensitivity.

[1]  Johan Karlsson,et al.  Fault injection into VHDL models: the MEFISTO tool , 1994 .

[2]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[3]  Christos Strydis,et al.  Compatibility Study of Compile-Time Optimizations for Power and Reliability , 2011, 2011 14th Euromicro Conference on Digital System Design.

[4]  Ravishankar K. Iyer,et al.  FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior under Faults , 1993, IEEE Trans. Software Eng..

[5]  Alfredo Benso,et al.  Fault-list collapsing for fault-injection experiments , 1998, Annual Reliability and Maintainability Symposium. 1998 Proceedings. International Symposium on Product Quality and Integrity.

[6]  Barry W. Johnson,et al.  A method to determine equivalent fault classes for permanent and transient faults , 1995, Annual Reliability and Maintainability Symposium 1995 Proceedings.

[7]  Karthik Pattabiraman,et al.  Characterizing the Impact of Intermittent Hardware Faults on Programs , 2015, IEEE Transactions on Reliability.

[8]  Roger Johansson,et al.  A Comparison of Inject-on-Read and Inject-on-Write in ISA-Level Fault Injection , 2015, 2015 11th European Dependable Computing Conference (EDCC).

[9]  Daniel P. Siewiorek,et al.  Automated robustness testing of off-the-shelf software components , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[10]  Henrique Madeira,et al.  RIFLE: A General Purpose Pin-level Fault Injector , 1994, EDCC.

[11]  Henrique Madeira,et al.  Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers , 1998, IEEE Trans. Software Eng..

[12]  Matei Ripeanu,et al.  Finding Resilience-Friendly Compiler Optimizations Using Meta-Heuristic Search Techniques , 2016, EDCC 2016.

[13]  Christos D. Antonopoulos,et al.  GemFI: A Fault Injection Tool for Studying the Behavior of Applications on Unreliable Substrates , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[14]  Michael F. P. O'Boyle,et al.  Evaluating the Effects of Compiler Optimisations on AVF , 2008 .

[15]  Ravishankar K. Iyer,et al.  Automated Derivation of Application-Specific Error Detectors Using Dynamic Analysis , 2011, IEEE Transactions on Dependable and Secure Computing.

[16]  QingPing Tan,et al.  SmartInjector: Exploiting intelligent fault injection for SDC rate analysis , 2013, 2013 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS).

[17]  Scott A. Mahlke,et al.  Harnessing Soft Computations for Low-Budget Fault Tolerance , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[18]  Daniel P. Siewiorek,et al.  Observations on the Effects of Fault Manifestation as a Function of Workload , 1992, IEEE Trans. Computers.

[19]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[20]  Mary W. Hall,et al.  Analyzing the effects of compiler optimizations on application reliability , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[21]  Kang G. Shin,et al.  DOCTOR: an integrated software fault injection environment for distributed real-time systems , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[22]  Y. Tosaka,et al.  Geometric effect of multiple-bit soft errors induced by cosmic ray neutrons on DRAM's , 2000, IEEE Electron Device Letters.

[23]  Domenico Cotroneo,et al.  On Fault Representativeness of Software Fault Injection , 2013, IEEE Transactions on Software Engineering.

[24]  David R. Kaeli,et al.  Eliminating microarchitectural dependency from Architectural Vulnerability , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[25]  Olaf Spinczyk,et al.  FAIL*: An Open and Versatile Fault-Injection Framework for the Assessment of Software-Implemented Hardware Fault Tolerance , 2015, 2015 11th European Dependable Computing Conference (EDCC).

[26]  Vassilios A. Chouliaras,et al.  Study of the Effects of SEU-Induced Faults on a Pipeline Protected Microprocessor , 2007, IEEE Transactions on Computers.

[27]  Antonio Martínez-Álvarez,et al.  Compiler-Directed Soft Error Mitigation for Embedded Systems , 2012, IEEE Transactions on Dependable and Secure Computing.

[28]  Gustavo Ribeiro Alves,et al.  Real time fault injection using a modified debugging infrastructure , 2006, 12th IEEE International On-Line Testing Symposium (IOLTS'06).

[29]  Rüdiger Kapitza,et al.  Fail∗: Towards a versatile fault-injection experiment framework , 2012, ARCS 2012.

[30]  Henrik Eriksson,et al.  MODIFI: A MODel-Implemented Fault Injection Tool , 2010, SAFECOMP.

[31]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[32]  Karama Kanoun,et al.  Dependability Benchmarking of Automotive Control Systems , 2008 .

[33]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[34]  Roger Johansson,et al.  On the Impact of Hardware Faults - An Investigation of the Relationship between Workload Inputs and Failure Mode Distributions , 2012, SAFECOMP.

[35]  Shubhendu S. Mukherjee,et al.  Perturbation-based Fault Screening , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[36]  Massimo Violante,et al.  A New Approach to Software-Implemented Fault Tolerance , 2004, J. Electron. Test..

[37]  Karthik Pattabiraman,et al.  LLFI : An Intermediate Code Level Fault Injector For Soft Computing Applications , 2013 .

[38]  Guangxia Xu,et al.  A Software-Implemented Fault Injection Toolkit for Dependency Analysis of Large Scale Distributed Applications , 2011 .

[39]  Daniel P. Siewiorek,et al.  A dimensionality model approach to testing and improving software robustness , 1999, 1999 IEEE AUTOTESTCON Proceedings (Cat. No.99CH36323).

[40]  Johan Karlsson,et al.  A comparison of simulation based and scan chain implemented fault injection , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[41]  Johan Karlsson,et al.  Comparing and Validating Measurements of Dependability Attributes , 2010, 2010 European Dependable Computing Conference.

[42]  Olaf Spinczyk,et al.  Avoiding Pitfalls in Fault-Injection Based Comparison of Program Susceptibility to Soft Errors , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[43]  Jinho Suh Models for soft errors in low-level caches , 2012 .

[44]  Johan Karlsson,et al.  Assembly-Level Pre-injection Analysis for Improving Fault Injection Efficiency , 2005, EDCC.

[45]  Karthik Pattabiraman,et al.  LLFI: An Intermediate Code-Level Fault Injection Tool for Hardware Faults , 2015, 2015 IEEE International Conference on Software Quality, Reliability and Security.

[46]  Jacob A. Abraham,et al.  FERRARI: a tool for the validation of system dependability properties , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[47]  Sanjay J. Patel,et al.  Examining ACE analysis reliability estimates using fault-injection , 2007, ISCA '07.

[48]  Johan Karlsson,et al.  Using heavy-ion radiation to validate fault-handling mechanisms , 1994, IEEE Micro.

[49]  Pedro J. Gil,et al.  Non-intrusive Software-Implemented Fault Injection in Embedded Systems , 2003, LADC.

[50]  Jacob A. Abraham,et al.  EMAX - An automatic extractor of high-level error models , 1993 .

[51]  K. Goswami,et al.  Simulation of Software Behavior Under Hardware Faults , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[52]  Dieter K. Schroder,et al.  Negative bias temperature instability: What do we understand? , 2007, Microelectron. Reliab..

[53]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[54]  Johan Karlsson,et al.  GOOFI-2: A tool for experimental dependability assessment , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[55]  Wolfgang Schröder-Preikschat,et al.  On aspect-orientation in distributed real-time dependable systems , 2002, Proceedings of the Seventh IEEE International Workshop on Object-Oriented Real-Time Dependable Systems. (WORDS 2002).

[56]  Sarita V. Adve,et al.  Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.

[57]  Daniel P. Siewiorek,et al.  FIAT-fault injection based automated testing environment , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[58]  Johan Karlsson,et al.  TWO FAULT INJECTION TECHNIQUES FOR TEST OF FAULT HANDLING MECHANISMS , 1991, 1991, Proceedings. International Test Conference.

[59]  Johan Karlsson,et al.  Software Implemented Detection and Recovery of Soft Errors in a Brake-by-Wire System , 2008, 2008 Seventh European Dependable Computing Conference.

[60]  Roger Johansson,et al.  A Study of the Impact of Bit-Flip Errors on Programs Compiled with Different Optimization Levels , 2014, 2014 Tenth European Dependable Computing Conference.

[61]  Sarita V. Adve,et al.  Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[62]  J. J. Serrano,et al.  Experimental validation of high-speed fault-tolerant systems using physical fault injection , 1999, Dependable Computing for Critical Applications 7.

[63]  Gustavo Ribeiro Alves,et al.  Real Time Fault Injection Using Enhanced OCD -- A Performance Analysis , 2006, 2006 21st IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems.

[64]  Henrique Madeira,et al.  Experimental evaluation of the fail-silent behavior in computers without error masking , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[65]  T. May,et al.  Alpha-particle-induced soft errors in dynamic memories , 1979, IEEE Transactions on Electron Devices.

[66]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[67]  P. M. Melliar-Smith,et al.  A program structure for error detection and recovery , 1974, Symposium on Operating Systems.

[68]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[69]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[70]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[71]  Johan Karlsson,et al.  Two software techniques for on-line error detection , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.