Finding a recovery line in uncoordinated checkpointing

In distributed systems running uncoordinated checkpointing schemes, a process should maintain several generations of local checkpoints to improve dependability, because a global checkpoint, which is a set of local checkpoints, is not always consistent. In this paper, we present an algorithm for finding a recovery line, where a given checkpoint is the earliest, in uncoordinated checkpointing schemes. Numerical examples of probability for the existence of a recovery line calculated with the proposed algorithm are also presented.

[1]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[2]  Richard Y. Kain,et al.  Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks , 1992, IEEE Trans. Parallel Distributed Syst..

[3]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[4]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[5]  D. Manivannan,et al.  Finding Consistent Global Checkpoints in a Distributed Computation , 1997, IEEE Trans. Parallel Distributed Syst..

[6]  Parameswaran Ramanathan,et al.  Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System , 1993, IEEE Trans. Software Eng..

[7]  Ge-Ming Chiu,et al.  Efficient Rollback-Recovery Technique in Distributed Computing Systems , 1996, IEEE Trans. Parallel Distributed Syst..

[8]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[9]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[10]  David R. Jefferson,et al.  Virtual time , 1985, ICPP.

[11]  Nancy A. Lynch,et al.  Global States of a Distributed System , 1982, IEEE Transactions on Software Engineering.