Analysis of a composite performance reliability measure for fault-tolerant systems

Today's concomitant needs for higher computing power and reliability has increased the relevance of multiple-processor fault-tolerant systems. Multiple functional units improve the raw performance (throughput, response time, etc.) of the system, and, as units fail, the system may continue to function albeit with degraded performance. Such systems and other fault-tolerant systems are not adequately characterized by separate performance and reliability measures. A composite measure for the performance and reliability of a fault-tolerant system observed over a finite mission time is analyzed. A Markov chain model is used for system state-space representation, and transient analysis is performed to obtain closed-form solutions for the density and moments of the composite measure. Only failures that cannot be repaired until the end of the mission are modeled. The time spent in a specific system configuration is assumed to be large enough to permit the use of a hierarchical model and static measures to quantify the performance of the system in individual configurations. For a multiple-processor system, where performance measures are usually associated with and aggregated over many jobs, this is tantamount to assuming that the time to process a job is much smaller than the time between failures. An extension of the results to general acyclic Markov chain models is included.

[1]  K. Mani Chandy,et al.  Open, Closed, and Mixed Networks of Queues with Different Classes of Customers , 1975, JACM.

[2]  Erol Gelenbe,et al.  Analysis and Synthesis of Computer Systems , 1980 .

[3]  W. N. Toy,et al.  The 3B20D Processor & DMERT operating system: Overview and architecture of the 3B20D Processor , 1983, The Bell System Technical Journal.

[4]  Prem S. Puri,et al.  A method for studying the integral functional of stochastic processes with applications , 1972 .

[5]  Prem S. Puri,et al.  A method for studying the integral functionals of stochastic processes with applications: I. Markov chain case , 1971, Journal of Applied Probability.

[6]  Edmundo de Souza e Silva,et al.  Calculating Cumulative Operational Time Distributions of Repairable Computer Systems , 1986, IEEE Transactions on Computers.

[7]  Omri Serlin Fault-Tolerant Systems in Commercial Applications , 1984, Computer.

[8]  Daniel P. Siewiorek Architecture of Fault-Tolerant Computers , 1984, Computer.

[9]  Ronald A. Howard,et al.  Dynamic Probabilistic Systems , 1971 .

[10]  Herb Schwetman Hybrid simulation models of computer systems , 1978, CACM.

[11]  Stephen S. Lavenberg,et al.  Computer Performance Modeling Handbook , 1983, Int. CMG Conference.

[12]  Terry Williams,et al.  Probability and Statistics with Reliability, Queueing and Computer Science Applications , 1983 .

[13]  John F. Meyer,et al.  On Evaluating the Performability of Degradable Computing Systems , 1980, IEEE Transactions on Computers.

[14]  Philip S. Yu,et al.  Modelling of Centralized Concurrency Control in a Multi-System Environment , 1985, SIGMETRICS.

[15]  Edward J. McCluskey,et al.  Hardware Fault-Tolerance , 1985, COMPCON.

[16]  Arnold O. Allen,et al.  Chapter Five – QUEUEING THEORY , 1978 .

[17]  Stephen S. Lavenberg,et al.  Mean-Value Analysis of Closed Multichain Queuing Networks , 1980, JACM.

[18]  P. J. Courtois Decomposability of Queueing Networks , 1977 .

[19]  Domenico Ferrari,et al.  Computer Systems Performance Evaluation , 1978 .

[20]  Richard E. Barlow,et al.  Statistical Theory of Reliability and Life Testing: Probability Models , 1976 .

[21]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[22]  W. C. Carter Hardware fault tolerance , 1986 .

[23]  R. H. Saul,et al.  InGaAsP LEDs for 1.3-μm optical transmission , 1983, The Bell System Technical Journal.

[24]  Journal of the Association for Computing Machinery , 1961, Nature.

[25]  Lorenzo Donatiello,et al.  Closed-Form Solution for System Availability Distribution , 1987, IEEE Transactions on Reliability.

[26]  Arnold O. Allen,et al.  Probability, statistics and queueing theory - with computer science applications (2. ed.) , 1981, Int. CMG Conference.

[27]  John F. Meyer,et al.  Closed-Form Solutions of Performability , 1982, IEEE Transactions on Computers.

[28]  Philip Heidelberger,et al.  Analysis of Performability for Stochastic Models of Fault-Tolerant Systems , 1986, IEEE Transactions on Computers.

[29]  S K Trivedi,et al.  On Modeling the Performance and Reliability of Multi-Mode Computer Systems , 1984 .

[30]  Charles H. Sauer,et al.  Simulation of Computer Communication Systems , 1983, Int. CMG Conference.

[31]  Joel F. Bartlett,et al.  A NonStop kernel , 1981, SOSP.

[32]  Robert S. Swarz,et al.  The theory and practice of reliable system design , 1982 .

[33]  John F. Meyer,et al.  A Performability Solution Method for Degradable Nonrepairable Systems , 1984, IEEE Transactions on Computers.

[34]  Philip S. Yu,et al.  Analysis of Fault Tolerant Multiprocessor Architectures for Lock Engine Design , 1987, Computer systems science and engineering.

[35]  Philip S. Yu,et al.  Performability Analysis of Operation Modes of Configurable Duplex Systems , 1986, FJCC.

[36]  Kishor S. Trivedi,et al.  On modelling the performance and reliability of multimode computer systems , 1986, J. Syst. Softw..

[37]  Kishor S. Trivedi,et al.  Analysis of M/G/2 - Standby Redundant System , 1983, Performance.

[38]  S K Trivedi,et al.  A Unified Model for the Analysis of Job Completion Time and Performability Measures in Fault-Tolerant Systems , 1985 .

[39]  Ragnar Huslende,et al.  A combined evaluation of performance and reliability for degradable systems , 1981, SIGMETRICS '81.

[40]  Walter L. Smith Renewal Theory and its Ramifications , 1958 .