Experimental study of software dependability

There is trend of increasing demand for highly dependable software systems. The factors that influence the dependability of software systems include software faults and hardware faults. To improve software dependability, it is necessary to understand the characteristics of these faults and how they affect software systems. In this study, a distributed fault injection and monitoring environment (DEFINE) has been developed. It consists of a target system, a fault injector, a software monitor, a workload generator, a controller, and several analysis utilities. DEFINE can inject software faults as well as hardware faults, can trace fault propagation in software systems and among machines, can monitor whether faults are activated and when the faults are activated, and has accurate time control. The fault models used are extracted from the results of field error data analyses and fault simulations. Fault injection experiments on the UNIX kernel (SunOS 4.1.2) and the Sun Network File System are conducted to study fault impact and to investigate fault propagation. Three kinds of fault injections are conducted: uniform fault injection, biased fault injection, and path-based fault injection. Based on the experimental results, fault propagation models have been developed for both hardware and software faults, and transient Markov reward analysis has been performed to evaluate the loss of performance after a fault is injected. Experimental results show that the majority of no-impact faults are latent. Memory faults and software faults usually have a very long latency, while bus faults and CPU faults tend to crash the system immediately. About half of the detected errors are data faults, and they are detected while the system is trying to access a memory location it has no privilege to access. Only about 8% of faults propagate to other UNIX subsystems. Fault propagation from servers to clients occurs more frequently than from clients to servers. The fault impact depends on the workload. Transient Markov reward analysis shows that the performance losses incurred by bus faults and CPU faults are much higher than those incurred by software and memory faults. Among software faults, the impact of pointer faults is higher than that of non-pointer faults.

[1]  K. Sreenivasan,et al.  On the construction of a representative synthetic workload , 1974, CACM.

[2]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[3]  Ravishankar K. Iyer,et al.  FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior under Faults , 1993, IEEE Trans. Software Eng..

[4]  Johan Karlsson,et al.  Evaluation of error detection schemes using fault injection by heavy-ion radiation , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[5]  James P. Black,et al.  Redundancy in Data Structures: Improving Software Fault Tolerance , 1980, IEEE Transactions on Software Engineering.

[6]  Richard B. Bunt,et al.  A synthetic workload model for a distributed system file server , 1991, SIGMETRICS '91.

[7]  J-C. Laprie,et al.  DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[8]  Kishor S. Trivedi,et al.  Probabilistic modeling of computer system availability , 1987 .

[9]  W. Kent Fuchs,et al.  Compiler-Assisted Multiple Instruction Retry , 1991 .

[10]  David C. Wood,et al.  Throughput measurement using a synthetic job stream , 1972, AFIPS '71 (Fall).

[11]  John Kunze,et al.  A trace-driven analysis of the unix 4 , 1985, SOSP 1985.

[12]  Murthy V.-S. Devarakonda,et al.  File Usage Analysis and Resource Usage Prediction: A Measurement-Based Study , 1988 .

[13]  Jean Arlat,et al.  Fault Injection for Dependability Validation: A Methodology and Some Applications , 1990, IEEE Trans. Software Eng..

[14]  Giuseppe Serazzi,et al.  A Characterization of the Variation in Time of Workload Arrival Patterns , 1985, IEEE Transactions on Computers.

[15]  Krishna Kant,et al.  Synthesizing Robust Data STructures - An Introduction , 1990, IEEE Trans. Computers.

[16]  Ravishankar K. Iyer,et al.  A user-oriented synthetic workload generator , 1992, [1992] Proceedings of the 12th International Conference on Distributed Computing Systems.

[17]  Mahadev Satyanarayanan,et al.  Scale and performance in a distributed file system , 1988, TOCS.

[18]  Daniel P. Siewiorek,et al.  FIAT-fault injection based automated testing environment , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[19]  Jim Gray,et al.  A census of Tandem system availability between 1985 and 1990 , 1990 .

[20]  Ravishankar K. Iyer,et al.  Analysis of the VAX/VMS error logs in multicomputer environments-a case study of software dependability , 1992, [1992] Proceedings Third International Symposium on Software Reliability Engineering.

[21]  Richard J. Lipton,et al.  Hints on Test Data Selection: Help for the Practicing Programmer , 1978, Computer.

[22]  R. Koga,et al.  SEU Vulnerability of the Zilog Z-80 and NSC-800 Microprocessors , 1985, IEEE Transactions on Nuclear Science.

[23]  Steven J. Zeil,et al.  Testing for Perturbations of Program Statements , 1983, IEEE Transactions on Software Engineering.

[24]  Ravishankar K. Iyer,et al.  Measurement-Based Analysis of Error Latency , 1987, IEEE Transactions on Computers.

[25]  A. Jefferson Offutt,et al.  Constraint-Based Automatic Test Data Generation , 1991, IEEE Trans. Software Eng..

[26]  Johan Karlsson,et al.  The Effects of Heavy-Ion Induced Single Event Upsets in the MC6809E Microprocessor , 1989, Fehlertolerierende Rechensysteme.

[27]  Albert Endres An analysis of errors and their causes in system programs , 1975 .

[28]  Ram Chillarege,et al.  Understanding large system failures-a fault injection experiment , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[29]  Brian Randell System structure for software fault tolerance , 1975 .

[30]  P. Duba,et al.  Transient fault behavior in a microprocessor-A case study , 1988, Proceedings 1988 IEEE International Conference on Computer Design: VLSI.

[31]  Giuseppe Serazzi,et al.  Measurement and Tuning of Computer Systems , 1984, Int. CMG Conference.

[32]  Ravishankar K. Iyer,et al.  Measurement and modeling of computer reliability as affected by system activity , 1986, TOCS.

[33]  William E. Howden,et al.  Weak Mutation Testing and Completeness of Test Sets , 1982, IEEE Transactions on Software Engineering.

[34]  Ravishankar K. Iyer,et al.  Study of fault propagation using fault injection in the UNIX system , 1993, Proceedings of 1993 IEEE 2nd Asian Test Symposium (ATS).

[35]  Luke T. Young,et al.  A Hybrid Monitor Assisted Fault Injection Environment , 1993 .

[36]  Fred L. Yang,et al.  Simulation of faults causing analog behavior in digital circuits , 1992 .

[37]  Giuseppe Serazzi,et al.  Workload characterization of computer systems and computer networks : collection of invited lectures presented at the International Workshop on Workload Characterization of Computer Systems and Computer Networks, Pavia Italy, 23-25 October, 1985 , 1986 .

[38]  Kang G. Shin,et al.  Measurement and Application of Fault Latency , 1986, IEEE Transactions on Computers.

[39]  Boris Beizer,et al.  Software Testing Techniques , 1983 .

[40]  Larry J Morell,et al.  A Theory of Fault-Based Testing , 1990, IEEE Trans. Software Eng..