论文信息 - A VP-accordant checkpointing protocol preventing useless checkpoints

A VP-accordant checkpointing protocol preventing useless checkpoints

A useless checkpoint corresponds to the occurrence of a checkpoint and communication pattern called Z-cycle. A recent result shows that ensuring a computation without Z-cycles is a particular application of a property, namely Virtual Precedence (VP), defined on an interval-based abstraction of a computation. We first propose a taxonomy of communication-induced checkpointing protocols based on the way they ensure the VP property. Then we derive a sufficient condition ensuring no Z-cycles in a distributed computation. This condition defines a checkpoint and communication pattern, namely suspect Z-cycle, such that if no suspect Z-cycle exists in a distributed computation then no Z-cycle exists. We present finally a communication-induced checkpointing protocol that avoids useless checkpoints by preventing on-the-fly the formation of suspect Z-cycles and discuss its performance with respect to other protocols.

Bruno Ciciani | Roberto Baldoni | Francesco Quaglia

[1] David L. Russell,et al. State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[2] Achour Mostéfaoui,et al. Virtual Precedence in Asynchronous Systems: Cencept and Applications , 1997, WDAG.

[3] Hon Fung Li,et al. Optimal Checkpointing and Local Recording for Domino-Free Rollback Recovery , 1987, Inf. Process. Lett..

[4] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[5] Achour Mostéfaoui,et al. A communication-induced checkpointing protocol that ensures rollback-dependency trackability , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[6] D. Manivannan,et al. A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[7] Yin-Min Wang,et al. Consistent Global checkpoints that Contain a Given Set of Local Chekpoints , 1997, IEEE Trans. Computers.

[8] Colin J. Fidge,et al. Logical time in distributed computing systems , 1991, Computer.

[9] Brian Randell,et al. System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[10] Augusto Ciuffoletti,et al. A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[11] Roberto Baldoni,et al. An index-based checkpointing algorithm for autonomous distributed systems , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[12] Brian Randell. System structure for software fault tolerance , 1975 .

[13] Jian Xu,et al. Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[14] Leslie Lamport,et al. Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[15] Achour Mostéfaoui,et al. Preventing useless checkpoints in distributed computations , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.