Dynamic fault tolerance in DCMA-a dynamically configurable multicomputer architecture

This paper introduces a new architecture for a fault-tolerant computer system which connects high-end PCs or workstations by a high-speed network. To achieve platform independence, coupling is based on the widely used PCI-bus. In contrast to commercially available fault-tolerant systems we strongly emphasize mechanisms for tolerating transient and intermittent faults. To keep hardware costs low the system is built with off-the-shelf computers and their extensions are kept as small as possible. To reduce the operational costs the system can be dynamically adapted to different demands on fault tolerance on a program-by-program basis. Adaptation is done transparently to the application software by the operating system. We use a commercially available real-time operating system with a POSIX-compliant UNIX-interface. The bandwidth of fault tolerance reaches from a non-redundant system of stand-alone computers, a master/checker configuration to a TMR-system. The high-performance network allows the system to operate as a parallel multicomputer, too.

[1]  G. A. Geist,et al.  The evolution of the PVM concurrent computing system , 1993, Digest of Papers. Compcon Spring.

[2]  Barry J. Gleeson,et al.  Fault Tolerance: Why Should I Pay for It? , 1994, Hardware and Software Architectures for Fault Tolerance.

[3]  D. Jewett,et al.  Integrity S2: A Fault-Tolerant Unix Platform , 1991, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[4]  Robert W. Horst,et al.  A flexible ServerNet-based fault-tolerant architecture , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[5]  A.L. Hopkins,et al.  FTMP—A highly reliable fault-tolerant multiprocess for aircraft , 1978, Proceedings of the IEEE.

[6]  Paul D. Ezhilchelvan,et al.  A Performance Evaluation Study of Pipeline TMR Systems , 1990, IEEE Trans. Parallel Distributed Syst..

[7]  Robert S. Swarz,et al.  The theory and practice of reliable system design , 1982 .

[8]  Henry G. Dietz,et al.  Hardware Barrier Synchronization: Static Barrier MIMD (SBM) , 1990, ICPP.

[9]  Kang G. Shin,et al.  Efficient Implementation Techniques for Gracefully Degradable Multiprocessor Systems , 1995, IEEE Trans. Computers.

[10]  Hermann Kopetz,et al.  Dependability: Basic Concepts and Terminology , 1992 .