A VP-accordant checkpointing protocol preventing useless checkpoints

A useless checkpoint corresponds to the occurrence of a checkpoint and communication pattern called Z-cycle. A recent result shows that ensuring a computation without Z-cycles is a particular application of a property, namely Virtual Precedence (VP), defined on an interval-based abstraction of a computation. We first propose a taxonomy of communication-induced checkpointing protocols based on the way they ensure the VP property. Then we derive a sufficient condition ensuring no Z-cycles in a distributed computation. This condition defines a checkpoint and communication pattern, namely suspect Z-cycle, such that if no suspect Z-cycle exists in a distributed computation then no Z-cycle exists. We present finally a communication-induced checkpointing protocol that avoids useless checkpoints by preventing on-the-fly the formation of suspect Z-cycles and discuss its performance with respect to other protocols.

[1]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[2]  Achour Mostéfaoui,et al.  Virtual Precedence in Asynchronous Systems: Cencept and Applications , 1997, WDAG.

[3]  Hon Fung Li,et al.  Optimal Checkpointing and Local Recording for Domino-Free Rollback Recovery , 1987, Inf. Process. Lett..

[4]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[5]  Achour Mostéfaoui,et al.  A communication-induced checkpointing protocol that ensures rollback-dependency trackability , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[6]  D. Manivannan,et al.  A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[7]  Yin-Min Wang,et al.  Consistent Global checkpoints that Contain a Given Set of Local Chekpoints , 1997, IEEE Trans. Computers.

[8]  Colin J. Fidge,et al.  Logical time in distributed computing systems , 1991, Computer.

[9]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[10]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[11]  Roberto Baldoni,et al.  An index-based checkpointing algorithm for autonomous distributed systems , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[12]  Brian Randell System structure for software fault tolerance , 1975 .

[13]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[14]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[15]  Achour Mostéfaoui,et al.  Preventing useless checkpoints in distributed computations , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.