Tolerating node failures in cache only memory architectures

COMAs (cache only memory architectures) are an interesting class of large scale shared memory multiprocessors. They extend the concepts of cache memories and shared virtual memory by using the local memories of the nodes as large caches for a single shared address space. Due to their large number of components, these architectures are particularly susceptible to hardware failures and so fault tolerance mechanisms have to be introduced to ensure a high availability. We propose an implementation of backward error recovery in a COMA which minimizes performance degradation and requires little hardware modifications. This implementation uses the features of a COMA to implement a stable storage abstraction using the standard memories of the architecture. Recovery data are replicated and mixed with current data in node memories both of which are managed in a transparent way using an extended coherence protocol.<<ETX>>

[1]  Butler W. Lampson,et al.  Distributed Systems - Architecture and Implementation, An Advanced Course , 1981, Advanced Course: Distributed Systems.

[2]  Michel Banâtre,et al.  Design decisions for the FTM: a general purpose fault tolerant machine , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[3]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[4]  Michel Banâtre,et al.  Cache management in a tightly coupled fault tolerant multiprocessor , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[5]  P. Stenstrom,et al.  Performance evaluation of link-based cache coherence schemes , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[6]  W. D. Ground The structure of the system. , 1883 .

[7]  J. Rothnie,et al.  The KSR 1: bridging the gap between shared memory and MPPs , 1993, Digest of Papers. Compcon Spring.

[8]  Butler W. Lampson,et al.  Atomic Transactions , 1980, Advanced Course: Distributed Systems.

[9]  Christine Morin,et al.  An Architecture for Tolerating Processor Failures in Shared Memory Multiprocessors , 1996, IEEE Trans. Computers.

[10]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.

[11]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[12]  Edward S. Harrison,et al.  Structure of System/88, a fault-tolerant computer , 1988, Comput. Syst. Sci. Eng..

[13]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[14]  Alain Gefflaut,et al.  SPAM: A Multiprocessor Execution-Driven Simulation Kernel , 1996, Int. J. Comput. Simul..

[15]  GuptaAnoop,et al.  Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992 .

[16]  Kenneth P. Birman,et al.  Replication and fault-tolerance in the ISIS system , 1985, SOSP '85.

[17]  Philip A. Bernstein,et al.  Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction processing , 1988, Computer.

[18]  Jim Gray,et al.  Fault Tolerance in Tandem Computer Systems , 1987 .

[19]  Anoop Gupta,et al.  Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.

[20]  Jeffrey F. Naughton,et al.  Real-time, concurrent checkpoint for parallel programs , 1990, PPOPP '90.

[21]  Michel Dubois,et al.  Cache Coherence on a Slotted Ring , 1991, ICPP.

[22]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[23]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[24]  Hermann Kopetz,et al.  Fault tolerance, principles and practice , 1990 .