On distributed object checkpointing and recovery

Recovery by checkpointing on distributed shared memory systems is investigated in this paper. The notion of corwzstent global states on a sequentially consistent shared memory system is defined. We investigate how consistent checkpoints can be obtained in these systems. In addition, a novel lazy checkpointing approach is proposed. It allows a cent rolled degree of concurrency and, at the same time, limits the amount of rollback propagation during recovery. Correctness requirements for efficient checkpointing are explored first and algorithms satisfying the requirements are developed subsequently. Several interesting properties of checkpointing on distributed shared memory systems are discovered. In particular, we show that for low levels of laziness, one can achieve better concurrency with more stable storage.

[1]  Wei-Tek Tsai,et al.  A low overhead checkpointing and rollback recovery scheme for distributed systems , 1989, Proceedings of the Eighth Symposium on Reliable Distributed Systems.

[2]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[3]  Hagit Attiya,et al.  Sequential consistency versus linearizability , 1994, TOCS.

[4]  Kai Li,et al.  Heterogeneous Distributed Shared Memory , 1992, IEEE Trans. Parallel Distributed Syst..

[5]  Kun-Lung Wu,et al.  Recoverable Distributed Shared Virtual Memory , 1990, IEEE Trans. Computers.

[6]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[7]  W. Kent Fuchs,et al.  Scheduling message processing for reducing rollback propagation , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[8]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[9]  Luke Lin,et al.  Using checkpoints to localize the effects of faults in distributed systems , 1989, Proceedings of the Eighth Symposium on Reliable Distributed Systems.

[10]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1989, TOCS.

[11]  Divyakant Agrawal,et al.  Using message semantics to reduce rollback in optimistic message logging recovery schemes , 1994, 14th International Conference on Distributed Computing Systems.

[12]  Bill Nitzberg,et al.  Distributed shared memory: a survey of issues and algorithms , 1991, Computer.

[13]  A. Fleischmann Distributed Systems , 1994, Springer Berlin Heidelberg.

[14]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[15]  Yuval Tamir,et al.  Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[16]  W. Kent Fuchs,et al.  Reducing interprocessor dependence in recoverable distributed shared memory , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[17]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[18]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[19]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[20]  Mustaque Ahamad,et al.  Slow memory: weakening consistency to enhance concurrency in distributed shared memories , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[21]  Michael Stumm,et al.  Fault tolerant distributed shared memory algorithms , 1990, Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing 1990.

[22]  Willy Zwaenepoel A Theoretical Model of Optimistic Recovery in Distributed Systems , 1990 .

[23]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[24]  W. Kent Fuchs,et al.  Optimistic message logging for independent checkpointing in message-passing systems , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[25]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[26]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[27]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[28]  Thierry Priol,et al.  KOAN: A Shared Virtual Memory for the iPSC/2 Hypercube , 1992, CONPAR.

[29]  W. Kent Fuchs,et al.  Relaxing consistency in recoverable distributed shared memory , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[30]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.