An efficient logging and recovery scheme for lazy release consistent distributed shared memory systems

Abstract Checkpointing and logging are widely used techniques to provide fault-tolerance for the distributed systems. However, logging imposes too much overhead on the processing to be a practical solution. In this paper, we propose a low-overhead logging scheme for the distributed shared memory system based on the lazy release consistency memory model. Unlike the previous schemes in which the logging is performed when a new data item is accessed by a process, the stable logging in the proposed scheme is performed only when a lock grant causes an actual dependency relation between the processes, which significantly reduces the logging frequency. Also, instead of making a stable log of the accessed data items, a process logs stably only some access information, and the accessed data items are saved in the volatile log. For the recovery from a failure, the correct version of the accessed data items can be effectively traced by using the logged access information. As a result, the amount of logged information can also be reduced.

[1]  Jennifer L. Welch,et al.  Implementation of recoverable distributed shared memory by logging writes , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[2]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[3]  Rana Ejaz Ahmed,et al.  Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[4]  Miguel Castro,et al.  Lightweight logging for lazy release consistent distributed shared memory , 1996, OSDI '96.

[5]  Junguk L. Kim,et al.  An Efficient Protocol for Checkpointing Recovery in Distributed Systems , 1993, IEEE Trans. Parallel Distributed Syst..

[6]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[7]  Brian Randell,et al.  Reliability Issues in Computing System Design , 1978, CSUR.

[8]  Mark D. Hill,et al.  Weak ordering—a new definition , 1998, ISCA '98.

[9]  Anne-Marie Kermarrec,et al.  A recoverable distributed shared memory integrating coherence and recoverability , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[10]  Kun-Lung Wu,et al.  Recoverable Distributed Shared Virtual Memory , 1990, IEEE Trans. Computers.

[11]  Mukesh Singhal,et al.  Using logging and asynchronous checkpointing to implement recoverable distributed shared memory , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[12]  Yuval Tamir,et al.  Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[13]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[14]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[15]  Dennis G. Shea,et al.  The SP2 High-Performance Switch , 1995, IBM Syst. J..

[16]  W. Kent Fuchs,et al.  Relaxing consistency in recoverable distributed shared memory , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[17]  Alan L. Cox,et al.  Network multicomputing using recoverable distributed shared memory , 1993, Digest of Papers. Compcon Spring.

[18]  Gilbert Cabillic,et al.  The performance of consistent checkpointing in distributed shared memory systems , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[19]  Kai Li,et al.  Shared virtual memory on loosely coupled multiprocessors , 1986 .

[20]  Michael Stumm,et al.  Fault tolerant distributed shared memory algorithms , 1990, Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing 1990.

[21]  Heon Young Yeom,et al.  An Improved Logging and Checkpointing Scheme for Recoverable Distributed Shared Memory , 1996, ASIAN.

[22]  Heon Young Yeom,et al.  An efficient logging scheme for recoverable distributed shared memory systems , 1997, Proceedings of 17th International Conference on Distributed Computing Systems.

[23]  Meichun Hsu,et al.  Fast recovery in distributed shared virtual memory systems , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[24]  W. Kent Fuchs,et al.  Reduced overhead logging for rollback recovery in distributed shared memory , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[25]  W. Kent Fuchs,et al.  Reducing interprocessor dependence in recoverable distributed shared memory , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[26]  Sundarrajan S Kanthadai Recoverable distributed shared memory , 1996 .