FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior under Faults

The authors present a fault injection and monitoring environment (FINE) as a tool to study fault propagation in the UNIX kernel. FINE injects hardware-induced software errors and software faults into the UNIX kernel and traces the execution flow and key variables of the kernel. FINE consists of a fault injector, a software monitor, a workload generator, a controller, and several analysis utilities. Experiments on SunOS 4.1.2 are conducted by applying FINE to investigate fault propagation and to evaluate the impact of various types of faults. Fault propagation models are built for both hardware and software faults. Transient Markov reward analysis is performed to evaluate the loss of performance due to an injected fault. Experimental results show that memory and software faults usually have a very long latency, while bus and CPU faults tend to crash the system immediately. About half of the detected errors are data faults, which are detected when the system is tries to access an unauthorized memory location. Only about 8% of faults propagate to other UNIX subsystems. Markov reward analysis shows that the performance loss incurred by bus faults and CPU faults is much higher than that incurred by software and memory faults. Among software faults, the impact of pointer faults is higher than that of nonpointer faults. >

[1]  Johan Karlsson,et al.  Evaluation of error detection schemes using fault injection by heavy-ion radiation , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[2]  Albert Endres An analysis of errors and their causes in system programs , 1975 .

[3]  Ravishankar K. Iyer,et al.  Measurement-Based Analysis of Error Latency , 1987, IEEE Transactions on Computers.

[4]  Ravishankar K. Iyer,et al.  Analysis of the VAX/VMS error logs in multicomputer environments-a case study of software dependability , 1992, [1992] Proceedings Third International Symposium on Software Reliability Engineering.

[5]  Ram Chillarege,et al.  Defect type and its impact on the growth curve (software development) , 1991, [1991 Proceedings] 13th International Conference on Software Engineering.

[6]  W. Kent Fuchs,et al.  Branch recovery with compiler-assisted multiple instruction retry , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[7]  R. Koga,et al.  SEU Vulnerability of the Zilog Z-80 and NSC-800 Microprocessors , 1985, IEEE Transactions on Nuclear Science.

[8]  Kishor S. Trivedi,et al.  Probabilistic modeling of computer system availability , 1987 .

[9]  Jean Arlat,et al.  Fault Injection for Dependability Validation: A Methodology and Some Applications , 1990, IEEE Trans. Software Eng..

[10]  Daniel P. Siewiorek,et al.  Fault Injection Experiments Using FIAT , 1990, IEEE Trans. Computers.

[11]  Jacob A. Abraham,et al.  FERRARI: a tool for the validation of system dependability properties , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[12]  Kishor S. Trivedi,et al.  Composite Performance and Dependability Analysis , 1992, Perform. Evaluation.

[13]  Jim Gray,et al.  A census of Tandem system availability between 1985 and 1990 , 1990 .

[14]  James P. Black,et al.  Redundancy in Data Structures: Improving Software Fault Tolerance , 1980, IEEE Transactions on Software Engineering.

[15]  Johan Karlsson,et al.  The Effects of Heavy-Ion Induced Single Event Upsets in the MC6809E Microprocessor , 1989, Fehlertolerierende Rechensysteme.

[16]  Ram Chillarege,et al.  Understanding large system failures-a fault injection experiment , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[17]  Krishna Kant,et al.  Synthesizing Robust Data STructures - An Introduction , 1990, IEEE Trans. Computers.

[18]  Kang G. Shin,et al.  Measurement and Application of Fault Latency , 1986, IEEE Transactions on Computers.

[19]  W. Kent Fuchs,et al.  Compiler-Assisted Multiple Instruction Retry , 1991 .

[20]  Ravishankar K. Iyer,et al.  A user-oriented synthetic workload generator , 1992, [1992] Proceedings of the 12th International Conference on Distributed Computing Systems.

[21]  John Paul Shen,et al.  Processor Control Flow Monitoring Using Signatured Instruction Streams , 1987, IEEE Transactions on Computers.

[22]  Janak H. Patel,et al.  Reliability of scrubbing recovery-techniques for memory systems , 1990 .

[23]  Ravishankar K. Iyer,et al.  Analysis of software halts in the tandem GUARDIAN operating system , 1992, [1992] Proceedings Third International Symposium on Software Reliability Engineering.

[24]  Jean Arlat,et al.  Fault injection for dependability validation of fault-tolerant computing systems , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[25]  Luke T. Young,et al.  A Hybrid Monitor Assisted Fault Injection Environment , 1993 .

[26]  Kang G. Shin,et al.  Error Detection Process - Model, Design, and Its Impact on Computer Performance , 1984, IEEE Trans. Computers.

[27]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[28]  G. B. Finelli Characterization of Fault Recovery through Fault Injection on FTMP , 1987, IEEE Transactions on Reliability.

[29]  Daniel P. Siewiorek,et al.  FIAT-fault injection based automated testing environment , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.