Checkpointing in hybrid distributed systems

To provide fault tolerance to computer systems suffering from transient faults, checkpointing and rollback recovery is widely-used. Among other techniques, two primary checkpointing schemes have been proposed: independent and coordinated schemes. However, most existing work addresses only the need to employ a single checkpointing and rollback recovery scheme to a target system. In this paper, issues are discussed and a new algorithm is developed to address the need of integrating independent and coordinated checkpointing schemes for applications running in a hybrid distributed environment containing multiple heterogeneous subsystems. The required changes to the original checkpointing schemes for each subsystem and the overall prevented unnecessary rollbacks for the integrated system are presented. Also described is an algorithm for collecting garbage checkpoints in the combined hybrid system.

[1]  David L. Presotto,et al.  Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[2]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[3]  Brian Randell,et al.  A formal model of atomicity in asynchronous systems , 1981, Acta Informatica.

[4]  Yin-Min Wang,et al.  Consistent Global checkpoints that Contain a Given Set of Local Chekpoints , 1997, IEEE Trans. Computers.

[5]  D. Manivannan,et al.  Quasi-Synchronous Checkpointing: Models, Characterization, and Classification , 1999, IEEE Trans. Parallel Distributed Syst..

[6]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[7]  H. Casanova,et al.  ACM SIGACT news distributed computing column 8 , 2002, SIGA.

[8]  Luís Moura Silva,et al.  Global checkpointing for distributed programs , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[9]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[10]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[11]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[12]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[13]  J. A. McDermid Checkpointing and Error Recovery in distributed Systems , 1981, ICDCS.

[14]  W. Kent Fuchs,et al.  Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[15]  Vijay K. Garg,et al.  Optimistic recovery in multi-threaded distributed systems , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[16]  Bharat K. Bhargava,et al.  Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[17]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[18]  Jiannong Cao,et al.  An abstract model of rollback recovery control in distributed systems , 1992, OPSR.

[19]  Brian Randell System structure for software fault tolerance , 1975 .

[20]  Ten-Hwang Lai,et al.  On Distributed Snapshots , 1987, Inf. Process. Lett..

[21]  Bharat K. Bhargava,et al.  Concurrent robust checkpointing and recovery in distributed systems , 1988, Proceedings. Fourth International Conference on Data Engineering.

[22]  B. Randell,et al.  STATE RESTORATION IN DISTRIBUTED SYSTEMS , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[23]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[24]  Mukesh Singhal,et al.  On Coordinated Checkpointing in Distributed Systems , 1998, IEEE Trans. Parallel Distributed Syst..