Locks and barriers in checkpointing and recovery

Dependency tracking between communicating tasks is an important concept in backward error recovery for parallel applications. One can extend the traditional dependence tracking model for message passing systems to track dependencies between shared memory and task private states for shared memory applications. The objective of this paper is to analyze the issues generated by locks and barriers in parallel applications so that we can checkpoint tasks at any time (even when holding or waiting for locks and barriers). In particular we attempt to extend earlier dependency tracking mechanisms to locks and barriers. We address both coordinated and uncoordinated checkpointing schemes.

[1]  Miguel Castro,et al.  Lightweight logging for lazy release consistent distributed shared memory , 1996, OSDI '96.

[2]  Gilbert Cabillic,et al.  The performance of consistent checkpointing in distributed shared memory systems , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[3]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[4]  Christine Morin,et al.  A Survey of Recoverable Distributed Shared Virtual Memory Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[5]  Nian-Feng Tzeng,et al.  Coherence-based coordinated checkpointing for software distributed shared memory systems , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.

[6]  Liviu Iftode,et al.  Scalable Fault-Tolerant Distributed Shared Memory , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[7]  W. Kent Fuchs,et al.  Consistent Global Checkpoints Based on Direct Dependency Tracking , 1994, Inf. Process. Lett..

[8]  Christine Morin,et al.  Common Mechanisms for Supporting Fault Tolerance in DSM and Message Passing Systems , 2002 .

[9]  Roberto Baldoni,et al.  Direct dependency-based determination of consistent global checkpoints , 2001, Comput. Syst. Sci. Eng..

[10]  W. Kent Fuchs,et al.  Reduced overhead logging for rollback recovery in distributed shared memory , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[11]  W. Kent Fuchs,et al.  Reducing interprocessor dependence in recoverable distributed shared memory , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[12]  Rosalia Christodoulopoulou,et al.  Dynamic Data Replication for Tolerating Single Node Failures in Shared Virtual Memory Clusters of Workstations , 2001 .

[13]  Christine Morin,et al.  Design, implementation and evaluation of ICARE: an efficient recoverable DSM , 1998, Softw. Pract. Exp..

[14]  Christine Morin,et al.  Checkpointing and recovery of shared memory parallel applications in a cluster , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[15]  Yuval Tamir,et al.  Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.