Independent global snapshots in large distributed systems

Distributed systems depend on consistent global snapshots for process recovery and garbage collection activity. We provide exact conditions for an arbitrary checkpoint based on independent dependency tracking within clusters of nodes. The method permits nodes (within clusters) to independently compute dependency information based on available (local) information. The existing models of global snapshot computations provide the necessary and sufficient conditions. But, these require expensive global computations. The proposed computations can be performed by a node to identify existing global checkpoints. The nodes can also compute conditions to make a checkpoint, or conditions, such that a collection of checkpoints, can belong to a global snapshot.

[1]  Tong-Ying Tony Juang,et al.  Efficient Algorithms for Crash Recovery in Distributed Systems , 1990, FSTTCS.

[2]  David L. Presotto,et al.  Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[3]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[4]  Yin-Min Wang,et al.  Consistent Global checkpoints that Contain a Given Set of Local Chekpoints , 1997, IEEE Trans. Computers.

[5]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[6]  Brian Randell,et al.  Reliability Issues in Computing System Design , 1978, CSUR.

[7]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[8]  Gerard Tel,et al.  Introduction to Distributed Algorithms: Contents , 2000 .

[9]  Subhash Bhalla,et al.  Independent node and process recovery in message passing distributed systems , 1996, Proceedings of 3rd International Conference on High Performance Computing (HiPC).

[10]  D. Manivannan,et al.  Finding Consistent Global Checkpoints in a Distributed Computation , 1997, IEEE Trans. Parallel Distributed Syst..

[11]  Salim Hariri,et al.  Architectural support for designing fault-tolerant open distributed systems , 1992, Computer.

[12]  Friedemann Mattern,et al.  Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation , 1993, J. Parallel Distributed Comput..

[13]  Subhash Bhalla,et al.  Garbage collection in message passing distributed systems , 1995, Proceedings the First Aizu International Symposium on Parallel Algorithms/Architecture Synthesis.

[14]  Andrew Birrell,et al.  Implementing remote procedure calls , 1984, TOCS.

[15]  Fred B. Schneider,et al.  Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[16]  Junguk L. Kim,et al.  An Efficient Protocol for Checkpointing Recovery in Distributed Systems , 1993, IEEE Trans. Parallel Distributed Syst..

[17]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[18]  Brian Randell System structure for software fault tolerance , 1975 .

[19]  James R. Russell,et al.  Optimistic failure recovery for very large networks , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[20]  Tomasz Imielinski,et al.  Replication and mobility , 1992, [1992 Proceedings] Second Workshop on the Management of Replicated Data.

[21]  Christine Morin,et al.  A Survey of Recoverable Distributed Shared Virtual Memory Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[22]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[23]  Willy Zwaenepoel,et al.  Recovery in distributed systems using asynchronous message logging and checkpointing , 1988, PODC '88.

[24]  S. Venkatesan,et al.  Crash recovery with little overhead , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[25]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[26]  Hon Fung Li,et al.  Optimal Checkpointing and Local Recording for Domino-Free Rollback Recovery , 1987, Inf. Process. Lett..

[27]  Arthur P. Goldberg,et al.  Restoring consistent global states of distributed computations , 1991, PADD '91.

[28]  Djemal H. Abawajy Orphan problems and remedies in distributed systems , 1993, OPSR.

[29]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[30]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[31]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[32]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[33]  Bharat K. Bhargava,et al.  Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[34]  Ajay D. Kshemkalyani,et al.  A basic unit of computation in distributed systems , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[35]  Tomasz Imielinski,et al.  Querying in Highly Mobile Distributed Environments , 1992, VLDB.

[36]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.