System Availability Analysis Considering Hardware/Software Failure Severities

Model-based analysis is a well-established approach to assess the influence of several factors on system availability within the context of system structure. Prevalent availability models in the literature consider all failures to be equivalent in terms of their consequences on system services. In other words, all the failures are assumed to be of the same level of severity. In practice, failures are typically classified into multiple severity levels, where failures belonging to the highest severity level cause a complete loss of service, while failures belonging to levels below the highest level enable the system to operate in a degraded mode. This makes it necessary to consider the influence of failure severities on system availability. In this paper we present a Markov model which considers failure severities of the components of the system in conjunction with its structure. The model also incorporates the repair of the components. Based on the model, we derive a closed form expression which relates system availability to the failure and repair parameters of the components. The failure parameters in the model are estimated based on the data collected during acceptance testing of a satellite system. However, since adequate data are not available to estimate the repair parameters, the closed form expressions are used to assess the sensitivity of the system availability to the repair parameters

[1]  Hairong Sun,et al.  Availability requirement for a fault management server in high-availability communication systems , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[2]  Karama Kanoun,et al.  A framework for modeling availability of E-business systems , 2001, Proceedings Tenth International Conference on Computer Communications and Networks (Cat. No.01EX495).

[3]  Michael R. Lyu,et al.  Software fault tolerance in a clustered architecture: techniques and reliability modeling , 1999, 1999 IEEE Aerospace Conference. Proceedings (Cat. No.99TH8403).

[4]  Dong Chen,et al.  Reliability and availability analysis for the JPL Remote Exploration and Experimentation System , 2002, Proceedings International Conference on Dependable Systems and Networks.

[5]  Kishor S. Trivedi,et al.  Application of semi-Markov process and CTMC to evaluation of UPS system availability , 2002, Annual Reliability and Maintainability Symposium. 2002 Proceedings (Cat. No.02CH37318).

[6]  Hairong Sun,et al.  A generic availability model for clustered computing systems , 2001, Proceedings 2001 Pacific Rim International Symposium on Dependable Computing.

[7]  Ravishankar K. Iyer,et al.  Measurement-based analysis of software reliability , 1996 .

[8]  Karama Kanoun,et al.  Availability of CAUTRA, a Subset of the French Air Traffic Control System , 1999, IEEE Trans. Computers.

[9]  Gwan S. Choi,et al.  Error and failure analysis of a UNIX server , 1998, Proceedings Third IEEE International High-Assurance Systems Engineering Symposium (Cat. No.98EX231).

[10]  Liang Yin,et al.  Hierarchical composition and aggregation of state-based availability and performability models , 2003, IEEE Trans. Reliab..

[11]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[12]  Mohamed Kaâniche,et al.  Measurement-based availability analysis of Unix systems in a distributed environment , 2001, Proceedings 12th International Symposium on Software Reliability Engineering.