A checkpoint protocol for an entry consistent shared memory system

Workstation clusters are becoming an interesting alternative to dedicated multiprocessors. In this environment, the probability of a failure, during an application’s execution, increases with the execution time and the number of workstations used. If no provision is made for handling failures, it is unlikely that long running applications will terminate successfully. One solution to this problem is process checkpointing. This paper presents a checkpoint protocol for a multithreaded distributed shared memory system based on the entry consistency memory model. The protocol allows transparent recovery from single node failures and, in some cases, from multiple node failures. A simple mechanism is used to determine if the system can be brought to a consistent state in the event of multiple machine crashes. The protocol keeps a distributed log of shared data accesses in the volatile memory of the processes, taking advantage of the independent failure characteristics of workstation clusters. Periodically, or whenever the log reaches a highwater mark, each process checkpoints its state, independently from the others. The protocol needs no extra messages during the failure-free period, since all checkpoint control information is piggybacked on the memory coherence protocol messages.

[1]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1986, PODC '86.

[2]  Michael Williams,et al.  Replication in the harp file system , 1991, SOSP '91.

[3]  Mark D. Hill,et al.  A Unified Formalization of Four Shared-Memory Models , 1993, IEEE Trans. Parallel Distributed Syst..

[4]  Mukesh Singhal,et al.  Using logging and asynchronous checkpointing to implement recoverable distributed shared memory , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[5]  Miguel Castro,et al.  Distributed shared object memory , 1993, Proceedings of IEEE 4th Workshop on Workstation Operating Systems. WWOS-III.

[6]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[7]  Mary Baker,et al.  The Recovery Box: Using Fast Recovery to Provide High Availability in the UNIX Environment , 1992, USENIX Summer.

[8]  Brian Randell System structure for software fault tolerance , 1975 .

[9]  Abraham Silberschatz,et al.  Incremental Recovery in Main Memory Database Systems , 1992, IEEE Trans. Knowl. Data Eng..

[10]  Philip J. Woest,et al.  The Wisconsin multicube: a new large-scale cache-coherent multiprocessor , 1988, ISCA '88.

[11]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[12]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[13]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[14]  Vaidy S. Sunderam,et al.  PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..

[15]  James S. Plank Efficient checkpointing on MIMD architectures , 1993 .

[16]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[17]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1981, TOCS.

[18]  Arthur P. Goldberg Transparent Recovery of Mach Applications , 1990, USENIX MACH Symposium.

[19]  W. Kent Fuchs,et al.  Relaxing consistency in recoverable distributed shared memory , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[20]  Miguel Castro,et al.  The DiSOM distributed shared object memory , 1994, EW 6.

[21]  Michael Stumm,et al.  Fault tolerant distributed shared memory algorithms , 1990, Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing 1990.

[22]  Brian N. Bershad,et al.  The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.

[23]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[24]  Miguel Castro,et al.  MIKE: A Distributed object-oriented programming platform on top of the Mach micro-kernel , 1993, USENIX MACH Symposium.

[25]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[26]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[27]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.