论文信息 - Lightweight logging for lazy release consistent distributed shared memory

Lightweight logging for lazy release consistent distributed shared memory

This paper presents a new logging and recovery algorithm for lazy release consistent distributed shared memory (DSM). The new algorithm tolerates single node failures by maintaining a distributed log of data dependencies in the volatile memory of processes. The algorithm adds very little overhead to the memory consistency protocol: it sends no additional messages during failure-free periods; it adds only a minimal amount of data to one of the DSM protocol messages; it introduces no forced rollbacks of non-faulty processes; and it performs no communication-induced accesses to stable storage. Furthermore, the algorithm logs only a very small amount of data, because it uses the log of memory accesses already maintained by the memory consistency protocol. The algorithm was implemented in TreadMarks, a state-of-the-art DSM system. Experimental results show that the algorithm has near zero time overhead and very low space overhead during failure-free execution, thus refuting the common belief that logging overhead is necessarily high in recoverable DSM systems.

[1] Jeffrey F. Naughton,et al. Real-time, concurrent checkpoint for parallel programs , 1990, PPOPP '90.

[2] David B. Johnson,et al. Distributed system fault tolerance using message logging and checkpointing , 1990 .

[3] Anne-Marie Kermarrec,et al. A recoverable distributed shared memory integrating coherence and recoverability , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[4] Jeffrey S. Chase,et al. Integrating coherency and recoverability in distributed systems , 1994, OSDI '94.

[5] Gilbert Cabillic,et al. The performance of consistent checkpointing in distributed shared memory systems , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[6] Kun-Lung Wu,et al. Recoverable Distributed Shared Virtual Memory , 1990, IEEE Trans. Computers.

[7] RICHARD KOO,et al. Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[8] Alan L. Cox,et al. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[9] David B. Johnson,et al. Sender-Based Message Logging , 1987 .

[10] Miguel Castro,et al. A checkpoint protocol for an entry consistent shared memory system , 1994, PODC '94.

[11] Willy Zwaenepoel,et al. Implementation and performance of Munin , 1991, SOSP '91.

[12] Mukesh Singhal,et al. Using logging and asynchronous checkpointing to implement recoverable distributed shared memory , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[13] Kai Li,et al. Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[14] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[15] Alan L. Cox,et al. Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[16] Willy Zwaenepoel,et al. The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[17] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.

[18] W. Kent Fuchs,et al. Reduced overhead logging for rollback recovery in distributed shared memory , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[19] W. Kent Fuchs,et al. Reducing interprocessor dependence in recoverable distributed shared memory , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[20] Miguel Castro,et al. Efficient and flexible object sharing , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[21] Anoop Gupta,et al. SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[22] Fred B. Schneider,et al. Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[23] Michael Stumm,et al. Fault tolerant distributed shared memory algorithms , 1990, Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing 1990.

[24] David E. Culler,et al. A case for NOW (networks of workstation) , 1995, PODC '95.

[25] W. Kent Fuchs,et al. Relaxing consistency in recoverable distributed shared memory , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[26] Yuval Tamir,et al. Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[27] Nitin H. Vaidya,et al. A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[28] Kai Li,et al. Shared virtual memory on loosely coupled multiprocessors , 1986 .