An introduction to snapshot algorithms in distributed computing

Recording on-the-fly global states of distributed executions is an important paradigm when one is interested in analysing, testing, or verifying properties associated with these executions. Since Chandy and Lamport`s (1985) seminal paper on this topic, this problem is called the snapshot problem. Unfortunately, the lack of both a globally shared memory and a global clock in a distributed system, added to the fact that transfer delays in these systems are finite but unpredictable, makes this problem non-trivial. This paper first discusses issues which have to be addressed to compute distributed snapshots in a consistent way. Then several algorithms which determine on-the-fly such snapshots are presented for several types of networks (according to the properties of their communication channels, namely, FIFO, non-FIFO, and causal delivery).

[1]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[2]  Kenneth P. Birman,et al.  Reliable communication in the presence of failures , 1987, TOCS.

[3]  Michel Raynal,et al.  Debugging tool for distributed Estelle programs , 1993, Comput. Commun..

[4]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[5]  Hon Fung Li,et al.  Global State Detection in Non-FIFO Networks , 1987, ICDCS.

[6]  Michel Raynal,et al.  On-the-fly replay: a practical paradigm and its implementation for distributed debugging , 1994, Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing.

[7]  Nancy A. Lynch,et al.  Discarding Obsolete Information in a Replicated Database System , 1987, IEEE Transactions on Software Engineering.

[8]  Madalene Spezialetti,et al.  Efficient Distributed Snapshots , 1986, ICDCS.

[9]  Ten-Hwang Lai,et al.  On Distributed Snapshots , 1987, Inf. Process. Lett..

[10]  Keith Marzullo,et al.  Consistent detection of global predicates , 1991, PADD '91.

[11]  Friedemann Mattern,et al.  Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation , 1993, J. Parallel Distributed Comput..

[12]  Michel Raynal,et al.  An introduction to the analysis and debug of distributed computations , 1995, Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing.

[13]  Ozalp Babaoglu,et al.  Consistent global states of distributed systems: fundamental concepts and mechanisms , 1993 .

[14]  S. Venkatesan,et al.  Message-optimal incremental snapshots , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[15]  Jong-Deok Choi,et al.  Breakpoints and halting in distributed programs , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[16]  Madalene Spezialetti,et al.  Simultaneous regions: a framework for the consistent monitoring of distributed systems , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[17]  Subbarayan Venkatesan,et al.  An Optimal Algorithm for Distributed Snapshots with Causal Message Ordering , 1994, Inf. Process. Lett..

[18]  Michel Raynal,et al.  Specification and Verification of Dynamic Properties in Distributed Computations , 1995, J. Parallel Distributed Comput..

[19]  Ajay D. Kshemkalyani,et al.  Efficient Detection and Resolution of Generalized Distributed Deadlocks , 1994, IEEE Trans. Software Eng..

[20]  B. R. Badrinath,et al.  Recording Distributed Snapshots Based on Causal Order of Message Delivery , 1992, Inf. Process. Lett..

[21]  Jean-Michel Hélary Observing Global States of Asynchronous Distributed Applications , 1989, WDAG.