Techniques for Modeling the Reliability of Fault-Tolerant Systems With the Markov State-Space Approach

This paper presents a step-by-step tutorial of the methods and the tools that were used for the reliability analysis of fault-tolerant systems. The approach used in this paper is the Markov (or semi-Markov) state-space method. The paper is intended for design engineers with a basic understanding of computer architecture and fault tolerance, but little knowledge of reliability modeling. The representation of architectural features in mathematical models is emphasized. This paper does not present details of the mathematical solution of complex reliability models. Instead, it describes the use of several recently developed computer programs SURE, ASSIST, STEM, and PAWS that automate the generation and the solution of these models.

[1]  Ernest J. Henley,et al.  Reliability engineering and risk assessment , 1981 .

[2]  J. Mcgough,et al.  Measurement of fault latency in a digital avionic mini processor, part 2 , 1983 .

[3]  Frank E. Grubbs,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[4]  A. L. White Upper and lower bounds for semi-Markov reliability models of reconfigurable systems , 1984 .

[5]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[6]  Ricky W. Butler,et al.  The PAWS and STEM reliability analysis programs , 1988 .

[7]  Sally C. Johnson Reliability analysis of large, complex systems using ASSIST , 1988 .

[8]  D. V. Lindley,et al.  An Introduction to Probability Theory and Its Applications. Volume II , 1967, The Mathematical Gazette.

[9]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[10]  A.L. Hopkins,et al.  FTMP—A highly reliable fault-tolerant multiprocess for aircraft , 1978, Proceedings of the IEEE.

[11]  Daniel L. Palumbo,et al.  Model reduction by trimming for a class of semi-Markov reliability models and the corresponding error bound , 1991 .

[12]  W Butler Ricky,et al.  Formal Design and Verification of a Reliable Computing Platform For Real-Time Control (Phase 3 Results) , 1990 .

[13]  Ricky W. Butler,et al.  A preliminary transient-fault experiment on the SIFT computer system , 1987 .

[14]  Allan L. White,et al.  Reliability estimation for reconfigurable systems with fast recovery , 1986 .

[15]  Kishor S. Trivedi,et al.  The hybrid automated reliability predictor , 1986 .

[16]  Ricky W. Butler,et al.  The SURE approach to reliability analysis , 1992 .

[17]  Anna L. Martensen,et al.  The Fault Tree Compiler (FTC): Program and mathematics , 1989 .

[18]  A. L. White Synthetic bounds for semi-Markov reliability models , 1985 .

[19]  M. H. Azadmanesh,et al.  The General Convergence Problem: A Unification of Synchronous and Asynchronous Systems , 1995 .

[20]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[21]  Ricky W. Butler,et al.  SURE reliability analysis: Program and mathematics , 1988 .

[22]  P. M. Melliar-Smith,et al.  Development and analysis of the Software Implemented Fault-Tolerance (SIFT) computer , 1984 .

[23]  Sally C. Johnson,et al.  ASSIST: User's manual , 1986 .

[24]  D. L. Palumbo,et al.  State reduction for semi-Markov reliability models , 1990, Annual Proceedings on Reliability and Maintainability Symposium.

[25]  S. J. Bavuso,et al.  Care 3 model overview and user's guide, first revision , 1985 .

[26]  Robert S. Swarz,et al.  The theory and practice of reliable system design , 1982 .

[27]  Kang G. Shin,et al.  Synchronization and fault-masking in redundant real-time systems , 1983 .