Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers

An important step in the development of dependable systems is the validation of their fault tolerance properties. Fault injection has been widely used for this purpose, however with the rapid increase in processor complexity, traditional techniques are also increasingly more difficult to apply. This paper presents a new software-implemented fault injection and monitoring environment, called Xception, which is targeted at modern and complex processors. Xception uses the advanced debugging and performance monitoring features existing in most modern processors to inject quite realistic faults by software, and to monitor the activation of the faults and their impact on the target system behavior in detail. Faults are injected with minimum interference with the target application. The target application is not modified, no software traps are inserted, and it is not necessary to execute the target application in special trace mode (the application is executed at full speed). Xception provides a comprehensive set of fault triggers, including spatial and temporal fault triggers, and triggers related to the manipulation of data in memory. Faults injected by Xception can affect any process running on the target system (including the kernel), and it is possible to inject faults in applications for which the source code is not available. Experimental, results are presented to demonstrate the accuracy and potential of Xception in the evaluation of the dependability properties of the complex computer systems available nowadays.

[1]  Jean Arlat,et al.  Fault Injection for Dependability Validation: A Methodology and Some Applications , 1990, IEEE Trans. Software Eng..

[2]  Ravishankar K. Iyer,et al.  A Measurement-Based Model for Workload Dependence of CPU Errors , 1986, IEEE Transactions on Computers.

[3]  Henrique Madeira,et al.  Experimental evaluation of the fail-silent behavior in computers without error masking , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[4]  Henrique Madeira,et al.  Assessing the effects of communication faults on parallel applications , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[5]  Kang G. Shin,et al.  DOCTOR: an integrated software fault injection environment for distributed real-time systems , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[6]  Ram Chillarege,et al.  Generation of an error set that emulates software faults based on field data , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[7]  Volkmar Sieh,et al.  Combining software-implemented and simulation-based fault injection into a single fault injection method , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[8]  Jacob A. Abraham,et al.  Dependability evaluation using hybrid fault/error injection , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[9]  Ravishankar K. Iyer,et al.  DEFINE: a distributed fault injection and monitoring environment , 1994, Proceedings of IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems.

[10]  Jacob A. Abraham,et al.  FERRARI: a tool for the validation of system dependability properties , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[11]  Johan Karlsson,et al.  Two software techniques for on-line error detection , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[12]  Luke T. Young,et al.  A Hybrid Monitor Assisted Fault Injection Environment , 1993 .

[13]  Ravishankar K. Iyer,et al.  Device-level transient fault modeling , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[14]  Henrique Madeira,et al.  RIFLE: A General Purpose Pin-level Fault Injector , 1994, EDCC.

[15]  Ram Chillarege,et al.  Understanding large system failures-a fault injection experiment , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[16]  Dhiraj K. Pradhan,et al.  Fault Injection: A Method for Validating Computer-System Dependability , 1995, Computer.

[17]  David J. Shippy,et al.  The POWER2 performance monitor , 1994, IBM J. Res. Dev..

[18]  Ravishankar K. Iyer,et al.  An approach towards benchmarking of fault-tolerant commercial systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[19]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[20]  Jacob A. Abraham,et al.  EMAX - An automatic extractor of high-level error models , 1993 .

[21]  Ravishankar K. Iyer,et al.  FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior under Faults , 1993, IEEE Trans. Software Eng..

[22]  Daniel P. Siewiorek,et al.  Effects of transient gate-level faults on program behavior , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[23]  Daniel P. Siewiorek,et al.  FIAT-fault injection based automated testing environment , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[24]  Ravishankar K. Iyer,et al.  FOCUS: An Experimental Environment for Fault Sensitivity Analysis , 1992, IEEE Trans. Computers.

[25]  Jan Torin,et al.  On microprocessor error behavior modeling , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[26]  Daniel P. Siewiorek,et al.  A Methodology for the Rapid Injection of Transient Hardware Errors , 1996, IEEE Trans. Computers.

[27]  Diamantino Costa,et al.  Fault injection spot-checks computer system dependability , 1999 .

[28]  Henrique Madeira,et al.  Experimental assessment of parallel systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[29]  Amber Roy-Chowdhury,et al.  A Fault-Tolerant Parallel Algorithm for Iterative Solution of the Laplace Equation , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[30]  Jean Arlat,et al.  Estimators for fault tolerance coverage evaluation , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[31]  J. Karlsson,et al.  Application of Three Physical Fault Injection Techniques to the Experimental Assessment of the MARS Architecture , 1995 .

[32]  Parag K. Lala,et al.  Fault tolerant and fault testable hardware design , 1985 .

[33]  Johan Karlsson,et al.  Fault injection into VHDL models: the MEFISTO tool , 1994 .

[34]  Ravishankar K. Iyer,et al.  Experimental evaluation , 1995 .

[35]  Johan Karlsson,et al.  Using heavy-ion radiation to validate fault-handling mechanisms , 1994, IEEE Micro.

[36]  Franklin T. Luk Algorithm-based Fault Tolerance for Parallel Matrix Equation Solvers , 1986, Optics & Photonics.