Using multi-stage and stratified sampling for inferring fault-coverage probabilities

Development of fault-tolerant computing systems requires accurate reliability modeling. Analytic, simulation, and hybrid models are commonly used for obtaining reliability measures. These measures are functions of component failure rates and fault-coverage (probabilities). Coverage provides information about the fault and error detection, isolation, and system recovery capabilities. This parameter can be derived by physical or simulated fault injection. Statistical inference has been used to extract meaningful information from sample observation. The problem of conducting fault injection experiments and statistically inferring the coverage from the information gathered in those experiments is addressed in this paper. We perform statistical experiments in a multi-dimensional space of events. In this way all major factors which influence the coverage (fault locations, timing characteristics of the fault, and the workload) are accounted for. Multi-stage, stratified, and combined multi-stage and stratified sampling are used in this paper for deriving the coverage. Equations of the mean, variance, and confidence interval of the coverage are provided. The statistical error produced by the injected faults which do not induce errors in the tested system (also known as the nonresponse problem) is considered, A program which emulates a typical fault environment was developed and four hypothetical systems are analyzed.

[1]  Jean Arlat,et al.  Estimators for Fault Tolerance Coverage Evaluation , 1995, IEEE Trans. Computers.

[2]  W. C. Carter,et al.  Reliability modeling techniques for self-repairing computer systems , 1969, ACM '69.

[3]  Robert Geist,et al.  Ultrahigh reliability estimates through simulation , 1989, Proceedings., Annual Reliability and Maintainability Symposium.

[4]  J. J. Stiffler Computer-aided reliability estimation , 1977 .

[5]  Jean Arlat,et al.  Fault injection for dependability validation of fault-tolerant computing systems , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[6]  S K Trivedi,et al.  HARP: The Hybrid Automated Reliability Predictor , 1985 .

[7]  Johan Karlsson,et al.  Evaluation of error detection schemes using fault injection by heavy-ion radiation , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[8]  Johan Karlsson,et al.  Two software techniques for on-line error detection , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[9]  Kishor S. Trivedi,et al.  Coverage Modeling for Dependability Analysis of Fault-Tolerant Systems , 1989, IEEE Trans. Computers.

[10]  C. Constantinescu Predicting performability of a fault-tolerant microcomputer for process control , 1992 .

[11]  Ram Chillarege,et al.  Understanding large system failures-a fault injection experiment , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[12]  Joel M. Schoen Performance and fault modeling with VHDL , 1992 .

[13]  Johan Karlsson,et al.  Fault injection into VHDL models: the MEFISTO tool , 1994 .

[14]  R. P. Kurlak,et al.  CPU coverage evaluation using automatic fault injection , 1982 .