COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors

Due to the increasing number of their components, Scalable Shared Memory Multiprocessors (SSMMs) have a very high probability of experiencing failures. Tolerating node failures therefore becomes very important for these architectures particularly if they must be used for long-running computations. In this paper, we show that the class of Cache Only Memory Architectures (COMA) are good candidates for building fault-tolerant SSMMs. A backward error recovery strategy can be implemented without significant hardware modification to previously proposed COMA by exploiting their standard replication mechanisms and extending the coherence protocol to transparently manage recovery data. Evaluation of the proposed fault-tolerant COMA is based on execution driven simulations using some of the Splash applications. We show that, for the simulated architecture, the performance degradation caused by fault-tolerance mechanisms varies from 5% in the best case to 35% in the worst case. The standard memory behavior is only slightly perturbed. Moreover, results also show that the proposed scheme preserves the architecture scalability and that the memory overhead remains low for parallel applications using mostly shared data.

[1]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[2]  Anoop Gupta,et al.  Comparative performance evaluation of cache-coherent NUMA and COMA architectures , 1992, ISCA '92.

[3]  W. Kent Fuchs,et al.  Experimental Evaluation of Multiprocessor Cache-Based Error Recovery , 1991, ICPP.

[4]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.

[5]  Edward S. Harrison,et al.  Structure of System/88, a fault-tolerant computer , 1988, Comput. Syst. Sci. Eng..

[6]  Butler W. Lampson,et al.  Distributed Systems - Architecture and Implementation, An Advanced Course , 1981, Advanced Course: Distributed Systems.

[7]  John B. Carter,et al.  An argument for simple COMA , 1995, Future Gener. Comput. Syst..

[8]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[9]  Anne-Marie Kermarrec,et al.  A recoverable distributed shared memory integrating coherence and recoverability , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[10]  Hermann Kopetz,et al.  Fault tolerance, principles and practice , 1990 .

[11]  Anant Agarwal,et al.  Directory-based cache coherence in large-scale multiprocessors , 1990, Computer.

[12]  Christine Morin,et al.  Tolerating node failures in cache only memory architectures , 1994, Proceedings of Supercomputing '94.

[13]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[14]  Anoop Gupta,et al.  The Stanford FLASH Multiprocessor , 1994, ISCA.

[15]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[16]  Kenneth P. Birman,et al.  Replication and fault-tolerance in the ISIS system , 1985, SOSP '85.

[17]  Jim Gray,et al.  Fault Tolerance in Tandem Computer Systems , 1987 .

[18]  Philip A. Bernstein,et al.  Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction processing , 1988, Computer.

[19]  Alain Gefflaut Proposition et evaluation d'une architecture multiprocesseur extensible a memoire partagee tolerante aux fautes , 1995 .

[20]  Wolfgang Hohl,et al.  Fault Tolerance in Distributed Shared Memory Multiprocessors , 1993, Parallel Computer Architectures.

[21]  James R. Larus,et al.  Abstract execution: A technique for efficiently tracing programs , 1990, Softw. Pract. Exp..

[22]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[23]  J. Rothnie,et al.  The KSR 1: bridging the gap between shared memory and MPPs , 1993, Digest of Papers. Compcon Spring.

[24]  Butler W. Lampson,et al.  Atomic Transactions , 1980, Advanced Course: Distributed Systems.