A model for availability analysis of distributed software/hardware systems

Abstract System availability is a major performance concern in distributed systems design and analysis. A typical kind of application on distributed systems has a homogeneously distributed software/hardware structure. That is, identical copies of distributed application software run on the same type of computers. In this paper, the system availability for this type of system is studied. Such a study is useful when studying optimal testing time or testing resource allocation. We consider both the case of simple two-host system, and also the more general case of multi-host system. A Markov model is developed and equations are derived to obtain the steady-state availability. Both software and hardware failures are considered, assuming that software faults are constantly being identified and removed upon a failure. Although a specific model for software reliability is used for illustration, the approach is a general one. Comparisons show that system availability changes in a similar way to single-host based software/hardware systems. Sensitivity analysis is also presented. In addition, the assumptions used in this paper are discussed.

[1]  Liming Chen,et al.  N-VERSION PROGRAMMINC: A FAULT-TOLERANCE APPROACH TO RELlABlLlTY OF SOFTWARE OPERATlON , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[2]  Athina Markopoulou,et al.  Optimal grouping of components in a distributed system , 1998, Comput. Commun..

[3]  Z. Jelinski,et al.  Software reliability Research , 1972, Statistical Computer Performance Evaluation.

[4]  Heeseok Lee An evaluation method for the availability of a distributed database management system , 1993, Inf. Manag..

[5]  Deng-Jyi Chen,et al.  Distributed-program reliability analysis: complexity and efficient algorithms , 1999 .

[6]  Brian Randell System structure for software fault tolerance , 1975 .

[7]  Barry W. Johnson,et al.  Reliability modeling of hardware/software systems , 1995 .

[8]  C. DeMarco A phase transition model for cascading network failure , 2001 .

[9]  David A. Rennels,et al.  Fault-Tolerant Computing—Concepts and Examples , 1984, IEEE Transactions on Computers.

[10]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach for Uniform Treatment of Hardware and Software Faults in Real-Time Applications , 1989, IEEE Trans. Computers.

[11]  Who Kee Chung Stochastic analysis of k-out-of-N:G redundant systems with repair and multiple critical and non-critical errors , 1995 .

[12]  Bev Littlewood,et al.  A Reliability Model for Systems with Markov Structure , 1975 .

[13]  Michael R. Lyu,et al.  Software fault tolerance in a clustered architecture: techniques and reliability modeling , 1999, 1999 IEEE Aerospace Conference. Proceedings (Cat. No.99TH8403).

[14]  Ushio Sumita,et al.  Analysis of software availability/reliability under the influence of hardware failures , 1986, IEEE Transactions on Software Engineering.

[15]  Amrit L. Goel,et al.  Time-Dependent Error-Detection Rate Model for Software Reliability and Other Performance Measures , 1979, IEEE Transactions on Reliability.

[16]  Lanfranco Lopriore,et al.  Object and process migration in a single-address-space distributed system , 2000, Microprocess. Microsystems.

[17]  Viktor K. Prasanna,et al.  Distributed program reliability analysis , 1986, IEEE Transactions on Software Engineering.

[18]  William E. Johnston,et al.  Coding for High Availability of a Distributed-Parallel Storage System , 1998, IEEE Trans. Parallel Distributed Syst..

[19]  Veena B. Mendiratta Reliability analysis of clustered computing systems , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[20]  Zhijie Pan,et al.  Importance analysis for the systems with common cause failures , 1995 .

[21]  Ladislav Hluchý,et al.  Hybrid Approach to Task Allocation in Distributed Systems , 1998, Comput. Artif. Intell..

[22]  Magdi S. Moustafa Reliability analysis of K-out-of-N: G systems with dependent failures and imperfect coverage , 1997 .

[23]  Salim Hariri,et al.  Hierarchical Modeling of Availability in Distributed Systems , 1995, IEEE Trans. Software Eng..

[24]  Jean-Claude Laprie,et al.  X-Ware Reliability and Availability Modeling , 1992, IEEE Trans. Software Eng..

[25]  Reinhold Kröger,et al.  System Level Support for Dependable Distributed Applications , 1991, Operating Systems of the 90s and Beyond.

[26]  Shikharesh Majumdar,et al.  The Stochastic Rendezvous Network Model for Performance of Synchronous Client-Server-like Distributed Software , 1995, IEEE Trans. Computers.

[27]  Walter Freiberger,et al.  Statistical Computer Performance Evaluation , 1972 .

[28]  Nancy G. Leveson,et al.  An experimental evaluation of the assumption of independence in multiversion programming , 1986, IEEE Transactions on Software Engineering.

[29]  Stephen S. Lavenberg,et al.  Modeling and Analysis of Computer System Availability , 1987, Computer Performance and Reliability.

[30]  Yeh Lam,et al.  A general model for consecutive-k-out-of-n: F repairable system with exponential distribution and (k-1)-step Markov dependence , 2001, Eur. J. Oper. Res..

[31]  Niraj K. Jha,et al.  COFTA : Hardware-Software Co-Synthesis of Heterogeneous Distributed Embedded Systems for Low Overhead Fault Tolerance , 1999 .

[32]  Jean-Claude Laprie,et al.  Dependability Evaluation of Software Systems in Operation , 1984, IEEE Transactions on Software Engineering.

[33]  Amrit L. Goel,et al.  Models for Hardware-Software System Operational-Performance Evaluation , 1981, IEEE Transactions on Reliability.

[34]  Simon P. Wilson,et al.  Software Reliability Modeling , 1994 .