Guaranteed Mutually Consistent Checkpointing in Distributed Computations

In this paper, we emplore the isomorphism between vector time and causality to characterize consistency of a set of checkpoints in a distributed computing. A necessary and sufficient condition, to determine if a set of checkpoints can form a consistent global checkpoint, is presented and proved using the isomorphic power of vector time and causality. To the best of our knowledge, this is the first attempt to use the isomorphism for this purpose. This condition leads to a simple and straightforward algorithm for a guaranteed mutually consistent global checkpointing. In our approach, a process can take a checkpoint whenever and wherever it wants while other related process may be asked to take an additional checkpoint for ensuring the mutual consistency. We also show how this condition and the resulting algorithm can be used to obtain a maximum and minimum global checkpoints, another important paradigm for distributed applications.

[1]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[2]  Yi-Min Wang,et al.  Maximum and minimum consistent global checkpoints and their applications , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[3]  MatternFriedemann,et al.  Detecting causal relationships in distributed computations , 1994 .

[4]  Brian Randell System structure for software fault tolerance , 1975 .

[5]  Michel Raynal,et al.  About State Recording in Asynchronous Computations (Abstract). , 1996, PODC 1996.

[6]  André Schiper,et al.  A New Algorithm to Implement Causal Ordering , 1989, WDAG.

[7]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[8]  Michel Raynal,et al.  About state recording in asynchronous computations , 1996, PODC '96.

[9]  Kim Taylor The Role of Inhibition on Asynchronous Consistent-Cut Protocols , 1989, WDAG.

[10]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[11]  Colin J. Fidge,et al.  Timestamps in Message-Passing Systems That Preserve the Partial Ordering , 1988 .

[12]  Zhonghua Yang,et al.  Global States and Time in Distributed Systems , 1994 .

[13]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[14]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[15]  André Schiper,et al.  The Causal Ordering Abstraction and a Simple Way to Implement it , 1991, Inf. Process. Lett..

[16]  Kim Taylor,et al.  The inhibition spectrum and the achievement of causal consistency , 1990, PODC '90.

[17]  Thomas Kunz,et al.  Vector time and causality among abstract events in distributed computations , 1997, Distributed Computing.