Fault Tolerance in Distributed Shared Memory Multiprocessors

Massively parallel systems represent a new challenge for fault tolerance. The designers of such systems cannot expect that no parts of the system will fail. With the significant increase in the complexity and number of components the chance of a single or multiple failure is no longer negligible. It is clear that the redundancy, reconfigurability and diagnosis techniques must be incorporated at the design stage itself and not as a subsequent add-on. In this paper we discuss the fault tolerance techniques developed for MEMSY, a massively parallel architecture. These techniques can, in principle, be easily transferred to other distributed shared memory multiprocessors.

[1]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[2]  Michel Banâtre,et al.  Design decisions for the FTM: a general purpose fault tolerant machine , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[3]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[4]  U. Hildebrand A Fault Tolerant Interconnection Network for Memory-Coupled Multiprocessor Systems , 1991, Fault-Tolerant Computing Systems.

[5]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[6]  Santosh K. Shrivastava,et al.  On the Duality of Fault Tolerant System Structures , 1987, Experiences with Distributed Systems.

[7]  Mario Dal Cin,et al.  MEMSY - A Modular Expandable Multiprocessor System , 1993, Parallel Computer Architectures.

[8]  Wolfgang Hohl,et al.  Hardware support for error detection in multiprocessor systems - a case study , 1993, Microprocess. Microsystems.

[9]  Jeffrey F. Naughton,et al.  Checkpointing multicomputer applications , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[10]  David J. Lu Watchdog Processors and Structural Integrity Checking , 1982, IEEE Transactions on Computers.

[11]  Wolfgang Hohl,et al.  Concurrent Error Detection Using Watchdog Processors in the Multiprocessor System MEMSY , 1991 .

[12]  Daniel P. Siewiorek Faults and Their Manifestation , 1986, Fault-Tolerant Distributed Computing.

[13]  Flaviu Cristian,et al.  Understanding fault-tolerant distributed systems , 1991, CACM.

[14]  Régis Leveugle,et al.  Design of microprocessors with built-in on-line test , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.