Using logging and asynchronous checkpointing to implement recoverable distributed shared memory

Distributed shared memory provides a useful paradigm for developing distributed applications. As the number of processors in the system and running time of distributed applications increase, the likelihood of processor failure increases. A method of recovering processes running in a distributed shared memory environment which minimizes lost work and the cost of recovery is desirable so that long-running applications are not adversely affected by processor failure. A technique for achieving recoverable distributed shared memory which utilizes asynchronous process checkpoints and logging of pages accessed via read operations on the shared address space is presented. The scheme supports independent process recovery without forcing rollback of operational processes during recovery. The method is particularly useful in environments where taking process checkpoints is expensive.<<ETX>>

[1]  Bill Nitzberg,et al.  Distributed shared memory: a survey of issues and algorithms , 1991, Computer.

[2]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[3]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[4]  Kai Li,et al.  Shared virtual memory on loosely coupled multiprocessors , 1986 .

[5]  David B. Johnsonandwillyzwaenepoel Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1990 .

[6]  Sang Hyuk Son,et al.  Distributed Checkpointing for Globally Consistent States of Databases , 1989, IEEE Transactions on Software Engineering.

[7]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[8]  Michael Stumm,et al.  Fault tolerant distributed shared memory algorithms , 1990, Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing 1990.

[9]  David R. Cheriton,et al.  Problem-oriented Shared Memory: A Decentralized Approach to Distributed System Design , 1986, IEEE International Conference on Distributed Computing Systems.

[10]  Michael Stumm,et al.  Algorithms implementing distributed shared memory , 1990, Computer.

[11]  Bharat K. Bhargava,et al.  Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[12]  Mustaque Ahamad,et al.  Implementing and programming causal distributed shared memory , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[13]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[14]  S. Venkatesan,et al.  Crash recovery with little overhead , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[15]  Robert H. Thomas,et al.  A Majority consensus approach to concurrency control for multiple copy databases , 1979, ACM Trans. Database Syst..

[16]  Meichun Hsu,et al.  Fast recovery in distributed shared virtual memory systems , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[17]  Kun-Lung Wu,et al.  Recoverable Distributed Shared Virtual Memory , 1990, IEEE Trans. Computers.

[18]  Taesoon Park,et al.  Checkpointing and rollback-recovery in distributed systems , 1989 .