Evaluation of Software-Implemented Fault-Tolerance (SIFT) Approach in Gracefully Degradable Multi-Computer Systems

This paper presents an analytical method for evaluating the reliability improvement for any size of multi-computer system based on Software-Implemented Fault-Tolerance (SIFT). The method is based on the equivalent failure rate Gamma, the single node failure rate lambda, the number of nodes in the system, N, the repair rate mu, the fault coverage factor c, the reconfiguration rate delta, and the percentage of blocking faults b1 and b2. The impact of these parameters on the reliability improvement has been evaluated for a gracefully degradable multi-computer system using our proposed analytical technique based on Markov chains. To validate our approach, we used the SIFT method which implements error detection at the node level, combined with a fast reconfiguration algorithm for avoiding faulty nodes. It is worth noting that the proposed method is applicable to any multi-computer systems' topology. The evaluation work presented in this paper focuses on the combination of analytical and experimental approaches, and more precisely on Markov chains. The SIFT method has been successfully implemented for a multi-computer system, nCube. The time overhead (reconfiguration & recomputation time) incurred by the injected fault, and the fault coverage factor c, are experimentally evaluated by means of a parallel version of the Software Object-Oriented Fault-Injection Tool (nSOFIT). The implemented SIFT approach can be used for real-time applications, when the time constraints should be met despite failures in the gracefully degradable multi-computer system

[1]  Dimiter R. Avresky,et al.  Software Implemented Fault Tolerance in Hypercube , 1999, Euro-Par.

[2]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[3]  Henrique Madeira,et al.  Experimental assessment of parallel systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[4]  Michel Gondran,et al.  System reliability: evaluation & prediction in engineering , 1986 .

[5]  Frank Thomson Leighton Introduction to parallel algorithms and architectures: arrays , 1992 .

[6]  D. R. Avresky,et al.  Method for designing and placing check sets based on control flow analysis of programs , 1996, Proceedings of ISSRE '96: 7th International Symposium on Software Reliability Engineering.

[7]  Raj Jain,et al.  The Art of Computer Systems Performance Analysis : Tech-niques for Experimental Design , 1991 .

[8]  Dimiter R. Avresky,et al.  Embedding and Reconfiguration of Spanning Trees in Faulty Hypercubes , 1999, IEEE Trans. Parallel Distributed Syst..

[9]  Kang G. Shin,et al.  DOCTOR: an integrated software fault injection environment for distributed real-time systems , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[10]  Hanan Samet,et al.  The Quadtree and Related Hierarchical Data Structures , 1984, CSUR.

[11]  Jean Arlat,et al.  Fault Injection and Dependability Evaluation of Fault-Tolerant Systems , 1993, IEEE Trans. Computers.

[12]  Jean Arlat,et al.  Estimators for Fault Tolerance Coverage Evaluation , 1995, IEEE Trans. Computers.

[13]  F. Thomson Leighton MESHES OF TREES , 1992 .

[14]  P.D.T. O'Connor,et al.  System Reliability. Evaluation and Prediction in Engineering , 1987 .

[15]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.