The Design and Implementation of a Customizable Fault Tolerance Framework

While there have been significant advances in fault tolerance research, the effort has focused on the design of individual fault-tolerant systems or methodologies. Recently, some research has been initiated to develop techniques that can provide a spectrum of fault tolerance capabilities. In this paper, we present the design of a fault tolerance framework that can support a wide range of applications with various fault tolerance requirements, various criticality levels, and various system models. The framework is designed to be parameterizable so that the user can configure it to obtain the desired features. Also, the framework is designed to be an off-the-shelf component such that application programs can be integrated within it easily to obtain the fault-tolerant version of the application system. A specialized N-modular redundancy (SNMR) scheme has been developed to serve as the primary approach for achieving efficient and cost-effective fault tolerance for the framework. In most cases, the SNMR scheme yields better performance and lower cost in providing fault tolerance as compared with conventional NMR schemes. It also enhances the scalability and customizability of the general replication method. This paper discusses the main issues in the design and implementation of the SNMR framework, including the major concept of the SNMR framework, various SNMR algorithms that tolerate various types of faults, an object-oriented overall system design and the interface protocol class hierarchy. The interface protocol class hierarchy provides a nice paradigm for the implementation of customizable, highly reusable, and easily extensible SNMR framework.

[1]  Jaynarayan H. Lala,et al.  Hardware and software fault tolerance: a unified architectural approach , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[2]  I-Ling Yen Specialized N-modular redundant processors in large-scale distributed systems , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.

[3]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[4]  Dave E. Eckhardt,et al.  A theoretical investigation of generalized voters for redundant systems , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[5]  Jaynarayan H. Lala,et al.  Fault tolerant parallel processor architecture overview , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[6]  Chris J. Walter,et al.  The MAFT Architecture for Distributed Fault Tolerance , 1988, IEEE Trans. Computers.

[7]  Hermann Kopetz,et al.  Tolerating transient faults in MARS , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[8]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[9]  Günter Grünsteidl,et al.  TTP - A Protocol for Fault-Tolerant Real-Time Systems , 1994, Computer.

[10]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[11]  Richard E. Harper,et al.  Use of a functional programming model in fault tolerant parallel processing , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[12]  Henrique Madeira,et al.  Experimental evaluation of the fail-silent behavior in computers without error masking , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[13]  Prathima Agrawal,et al.  Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy , 1988, IEEE Trans. Computers.

[14]  D. Powell,et al.  The Delta-4 Approach to Dependability in Open Distributed Computing Systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[15]  Silvano Maffeis Prianha: A CORBA Tool For High Availability , 1997, Computer.

[16]  Jiri Gaisler Concurrent error-detection and modular fault-tolerance in a 32-bit processing core for embedded space flight applications , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.