A Checkpointing-Recovery Scheme for Domino-Free Distributed Systems

Communication-induced checkpointing algorithms require cooperating processes, which take checkpoints at their own pace, to take some forced checkpoints in order to guarantee domino-freeness. In this paper we present a checkpointing-recovery scheme which reduces the number of forced checkpoints, compared to previous solutions, while piggybacking, on each message, only three integers as control information. This is achieved by using information about the history of a process and an equivalence relation between local checkpoints that we introduce in this paper. A simulation study is also presented which quantifies such a reduction.

[1]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[2]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[3]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[4]  WangYi-Min Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints , 1997 .

[5]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[6]  Taesoon Park,et al.  Checkpointing and rollback-recovery in distributed systems , 1989 .

[7]  D. Manivannan,et al.  A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[8]  Hon Fung Li,et al.  Optimal Checkpointing and Local Recording for Domino-Free Rollback Recovery , 1987, Inf. Process. Lett..

[9]  D. Manivannan,et al.  Quasi-Synchronous Checkpointing: Models, Characterization, and Classification , 1999, IEEE Trans. Parallel Distributed Syst..

[10]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[11]  Michel Raynal,et al.  On Modeling Consistent Checkpoints and the Domino Effect in Distributed Systems , 1995 .

[12]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[13]  B. R. Badrinath,et al.  Checkpointing distributed applications on mobile computers , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[14]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[15]  Achour Mostéfaoui,et al.  A communication-induced checkpointing protocol that ensures rollback-dependency trackability , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[16]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.