Measurement-based evaluation of operating system fault tolerance

The authors demonstrate a methodology for evaluating the fault-tolerance characteristics of operational software and illustrate it through case studies of three operating systems: the Tandem GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Based on measurements from these systems, software error characteristics are investigated by analyzing error distributions and correlation. Two levels of models are developed to analyze the error and recovery processes inside an operating system and the interactions among multiple copies of an operating system running in a distributed environment. Reward analysis is used to evaluate the loss of service due to software errors and the effect of fault-tolerant techniques implemented in the systems. >

[1]  Algirdas Avizienis,et al.  Fault Tolerance by Design Diversity: Concepts and Experiments , 1984, Computer.

[2]  John F. Meyer,et al.  Performability: A Retrospective and Some Pointers to the Future , 1992, Perform. Evaluation.

[3]  Ravishankar K. Iyer,et al.  Analysis of software halts in the tandem GUARDIAN operating system , 1992, [1992] Proceedings Third International Symposium on Software Reliability Engineering.

[4]  Joanne Bechta Dugan Correlated Hardware Failures in Redundant Systems , 1992 .

[5]  Peter G. Bishop,et al.  PODS revisited-a study of software failure behaviour , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[6]  Nancy P. Kronenberg,et al.  VAXcluster: a closely-coupled distributed system , 1986, TOCS.

[7]  Jim Gray,et al.  A census of Tandem system availability between 1985 and 1990 , 1990 .

[8]  Ravishankar K. Iyer,et al.  Analysis and Modeling of Correlated Failures in Multicomputer Systems , 1992, IEEE Trans. Computers.

[9]  Paola Velardi,et al.  A Study of Software Failures and Recovery in the MVS Operating System , 1984, IEEE Transactions on Computers.

[10]  Ravishankar K. Iyer,et al.  Effect of System Workload on Operating System Reliability: A Study on IBM 3081 , 1985, IEEE Transactions on Software Engineering.

[11]  Myron Hecht,et al.  Software reliability in the system context , 1986, IEEE Transactions on Software Engineering.

[12]  Jean Arlat,et al.  Dependability Modeling and Evaluation of Software Fault-Tolerant Systems , 1990, IEEE Trans. Computers.

[13]  Mei-Chen Hsueh,et al.  A measurement-based model of software reliability in a production environment , 1987 .

[14]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[15]  Brian Randell System structure for software fault tolerance , 1975 .

[16]  Jean-Claude Laprie,et al.  Dependability Evaluation of Software Systems in Operation , 1984, IEEE Transactions on Software Engineering.

[17]  Paola Velardi,et al.  Hardware-Related Software Errors: Measurement and Analysis , 1985, IEEE Transactions on Software Engineering.

[18]  Kishor S. Trivedi,et al.  Composite Performance and Dependability Analysis , 1992, Perform. Evaluation.

[19]  Amrit L. Goel,et al.  Software Reliability Models: Assumptions, Limitations, and Applicability , 1985, IEEE Transactions on Software Engineering.

[20]  Kishor S. Trivedi,et al.  Performability Modeling Based on Real Data: A Case Study , 1988, IEEE Trans. Computers.

[21]  Ronald A. Howard,et al.  Dynamic Probabilistic Systems , 1971 .

[22]  Ravishankar K. Iyer,et al.  Analysis of the VAX/VMS error logs in multicomputer environments-a case study of software dependability , 1992, [1992] Proceedings Third International Symposium on Software Reliability Engineering.

[23]  David F. McAllister,et al.  Fault-Tolerant SoFtware Reliability Modeling , 1987, IEEE Transactions on Software Engineering.