Faults, symptoms, and software fault tolerance in the Tandem GUARDIAN90 operating system

The authors present a measurement-based study of software failures and recovery in the Tandem GUARDIAN90 operating system using a collection of memory dump analyses of field software failures. They identify the effects of software faults on the processor state and trace the propagation of the effects to other areas of the system. They also evaluate the role of the defensive programming techniques and the software fault tolerance of the process pair mechanism implemented in the Tandem system. Results show that the Tandem system tolerates nearly 82% of reported field software faults, thus demonstrating the effectiveness of the system against software faults. Consistency checks made by the operating system detect 52% of software problems and prevent any error propagation in 31% of software problems. Results also show that 72% of reported field software failures are recurrences of known software faults and 70% of the recurrence groups have identical characteristics.

[1]  Algirdas Avizienis,et al.  Fault Tolerance by Design Diversity: Concepts and Experiments , 1984, Computer.

[2]  Jim Gray,et al.  A census of Tandem system availability between 1985 and 1990 , 1990 .

[3]  Paola Velardi,et al.  A Study of Software Failures and Recovery in the MVS Operating System , 1984, IEEE Transactions on Computers.

[4]  Jean-Claude Laprie,et al.  Dependability Evaluation of Software Systems in Operation , 1984, IEEE Transactions on Software Engineering.

[5]  Albert Endres,et al.  An analysis of errors and their causes in system programs , 1975, IEEE Transactions on Software Engineering.

[6]  Victor R. Basili,et al.  Software errors and complexity: an empirical investigation0 , 1984, CACM.

[7]  Michael R. Lyu,et al.  What is software reliability? , 1994, Proceedings of COMPASS'94 - 1994 IEEE 9th Annual Conference on Computer Assurance.

[8]  Jean Arlat,et al.  Dependability Modeling and Evaluation of Software Fault-Tolerant Systems , 1990, IEEE Trans. Computers.

[9]  Steve Smoliar Two books named "Software reliability": review of "Software reliability" by Thomas A. Thayer, Myron Lipow, Eldred C. Nelson. North-Holland 1978. and "Software relibaility" by Hermann Kopetz. Springer-Verlag 1980. , 1981, SOEN.

[10]  I. Lee,et al.  Measurement-based evaluation of operating system fault tolerance , 1993 .

[11]  David M. Weiss,et al.  Evaluating software development by error analysis: The data from the Architecture Research Facility , 1984, J. Syst. Softw..

[12]  Ravishankar K. Iyer,et al.  Analysis of software halts in the tandem GUARDIAN operating system , 1992, [1992] Proceedings Third International Symposium on Software Reliability Engineering.

[13]  Inderpal S. Bhandari,et al.  Orthogonal Defect Classification - A Concept for In-Process Measurements , 1992, IEEE Trans. Software Eng..

[14]  Ravishankar K. Iyer,et al.  Effect of System Workload on Operating System Reliability: A Study on IBM 3081 , 1985, IEEE Transactions on Software Engineering.

[15]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[16]  Victor R. Basili,et al.  Software errors and complexity: an empirical investigation , 1993 .

[17]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[18]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.