Availability Monitor for a Software Based System

Computer and communication systems are ubiquitous and are used extensively in safety critical, life critical, and finance critical applications. Due to the excessive cost of outages, downtime is not tolerated by the users. High availability applications are being offered by most vendors and are touted to have availability features such as hardware redundancy, software replication, automated detection, failover, hot swap and so on. However, quantitative validation of high availability is rarely provided. The purpose of this paper is to present an innovative method of monitoring and displaying, in real-time, the empirically observed availability of the system. Apart from presenting the novel idea and associated statistical methods, we also sketch the actual implementation of this idea in a project (a middleware appliance) at the WebSphere Institute of IBMRTP.

[1]  J. Bert Keats,et al.  Statistical Methods for Reliability Data , 1999 .

[2]  Michael Otey,et al.  Microsoft SQL Server 2005 Developer's Guide , 2005 .

[3]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[4]  R.M. Fricks,et al.  Steady-state availability estimation using field failure data , 2004, Annual Symposium Reliability and Maintainability, 2004 - RAMS.

[5]  Kishor S. Trivedi,et al.  Model Based Approach for Autonomic Availability Management , 2006, ISAS.

[6]  Richard M. Bailey,et al.  Performance and Availability Measurement of the IBM Information Network , 1983, IBM Syst. J..

[7]  Ihor Javorskyj,et al.  Probabilistic models and statistical methods for the analysis of vibrational signals in the problems of diagnostics of machines and structures , 1997 .

[8]  Kishor S. Trivedi,et al.  Uncertainty analysis in reliability modeling , 2001, Annual Reliability and Maintainability Symposium. 2001 Proceedings. International Symposium on Product Quality and Integrity (Cat. No.01CH37179).

[9]  Kishor S. Trivedi,et al.  Hierarchical computation of interval availability and related metrics , 2004, International Conference on Dependable Systems and Networks, 2004.

[10]  S. Swaminathan,et al.  Sample sizes for system availability , 2002, Annual Reliability and Maintainability Symposium. 2002 Proceedings (Cat. No.02CH37318).

[11]  Steven W Hunter,et al.  Availability Modeling and Analysis of a Two Node Cluster , 2000 .

[12]  Dong Tang,et al.  Automatic generation of availability models in RAScad , 2002, Proceedings International Conference on Dependable Systems and Networks.

[13]  David Patterson,et al.  Self-repairing computers. , 2003, Scientific American.

[14]  A. Sathaye,et al.  Validating complex computer system availability models , 1990 .

[15]  Kishor S. Trivedi,et al.  Performance and reliability evaluation of passive replication schemes in application level fault tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[16]  Wendai Wang,et al.  Confidence limits on the inherent availability of equipment , 2000, Annual Reliability and Maintainability Symposium. 2000 Proceedings. International Symposium on Product Quality and Integrity (Cat. No.00CH37055).