Fault-tolerance using cache-coherent distributed shared memory systems

Describes new protocols augmenting traditional cache coherency mechanisms to implement fault tolerance based on recovery blocks and checkpointing. Concurrent processes compound rollback recovery since the rollback can potentially lead to a "domino effect" whereby the process is rolled back to the beginning. Several approaches have been proposed to limit the domino effect. One set of such techniques requires communicating processes to periodically synchronize in order to checkpoint a globally consistent state. These schemes can be implemented more naturally on distributed shared memory systems using synchronization on shared memory. We have developed extensions to well-known cache-coherency methods (e.g. directory-based) for the implementation of checkpointing consistent states.

[1]  Michael Stumm,et al.  Fault tolerant distributed shared memory algorithms , 1990, Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing 1990.

[2]  David J. Lilja,et al.  Cache coherence in large-scale shared-memory multiprocessors: issues and comparisons , 1993, CSUR.

[3]  Kun-Lung Wu,et al.  Recoverable Distributed Shared Virtual Memory , 1990, IEEE Trans. Computers.

[4]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[5]  Brian Randell System Structure for Software Fault Tolerance , 1975, IEEE Trans. Software Eng..

[6]  Anoop Gupta,et al.  Performance evaluation of memory consistency models for shared-memory multiprocessors , 1991, ASPLOS IV.

[7]  W. Kent Fuchs,et al.  Relaxing consistency in recoverable distributed shared memory , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[8]  K. H. Kim,et al.  Approaches to Mechanization of the Conversation Scheme Based on Monitors , 1982, IEEE Transactions on Software Engineering.

[9]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[10]  John K. Bennett,et al.  Efficient runtime support for cluster-based distributed shared memory multiprocessors , 1998 .

[11]  Liviu Iftode,et al.  Scope Consistency: A Bridge between Release Consistency and Entry Consistency , 1996, SPAA '96.

[12]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[13]  Brett D. Fleisch,et al.  A memory approach to consistent, reliable distributed shared memory , 1995, Proceedings 5th Workshop on Hot Topics in Operating Systems (HotOS-V).

[14]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[15]  W. Kent Fuchs,et al.  Ensuring Correct Rollback Recovery in Distributed Shared Memory Systems , 1995, J. Parallel Distributed Comput..

[16]  Jeffrey S. Chase,et al.  Integrating coherency and recoverability in distributed systems , 1994, OSDI '94.