Impossibility of scalar clock-based communication-induced checkpointing protocols ensuring the RDT property

Abstract Communication-induced checkpointing protocols constitute an interesting approach to the on-line determination of checkpoint and communication patterns enjoying desirable properties such as domino-effect freedom. They do not add control messages to the computation, but instead may attach control information to computation messages. Among these protocols, scalar clock-based protocols are particularly attractive as they use a single integer as control information. An interesting property of checkpoint and communication patterns is Rollback-Dependency Trackability, which ensures that all local checkpoint dependencies are on-the-fly trackable. So, it would be nice to design scalar clock-based communication-induced checkpointing protocols providing the RDT property, a previously open question. This paper shows that the design of such protocols is impossible.

[1]  Michel Raynal,et al.  Consistency Issues in Distributed Checkpoints , 1999, IEEE Trans. Software Eng..

[2]  Sy-Yen Kuo,et al.  Theoretical Analysis for Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability , 1998, IEEE Trans. Parallel Distributed Syst..

[3]  Yin-Min Wang,et al.  Consistent Global checkpoints that Contain a Given Set of Local Chekpoints , 1997, IEEE Trans. Computers.

[4]  Achour Mostéfaoui,et al.  A communication-induced checkpointing protocol that ensures rollback-dependency trackability , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[5]  LamportLeslie Time, clocks, and the ordering of events in a distributed system , 1978 .

[6]  Bruno Ciciani,et al.  A VP-accordant checkpointing protocol preventing useless checkpoints , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[7]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[8]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[9]  Sy-Yen Kuo,et al.  Evaluations of Domino-Free Communication-Induced Checkpointing Protocols , 1999, Inf. Process. Lett..

[10]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[11]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[12]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[13]  Achour Mostéfaoui,et al.  Virtual Precedence in Asynchronous Systems: Cencept and Applications , 1997, WDAG.

[14]  Hon Fung Li,et al.  Optimal Checkpointing and Local Recording for Domino-Free Rollback Recovery , 1987, Inf. Process. Lett..

[15]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[16]  Achour Mostéfaoui,et al.  Communication-Induced Determination of Consistent Snapshots , 1999, IEEE Trans. Parallel Distributed Syst..

[17]  D. Manivannan,et al.  A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[18]  Roberto Baldoni,et al.  An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[19]  D. Manivannan,et al.  Quasi-Synchronous Checkpointing: Models, Characterization, and Classification , 1999, IEEE Trans. Parallel Distributed Syst..

[20]  Achour Mostéfaoui,et al.  Communication-based prevention of useless checkpoints in distributed computations , 2000, Distributed Computing.