Efficient Message Logging for Uncoordinated Checkpointing Protocols

A message is in-transit with respect to a global state if its sending is recorded in this global state, while its receipt is not. Checkpointing algorithms have to log such in-transit messages in order to restore the state of channels when a computation has to be resumed from a consistent global state after a failure has occurred. Coordinated checkpointing algorithms log those in-transit messages exactly on stable storage. Because of their lack of synchronization, uncoordinated checkpointing algorithms conservatively log more messages.

[1]  Frédéric Ruget,et al.  Cheaper Matrix Clocks , 1994, WDAG.

[2]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[3]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[4]  Luís Moura Silva,et al.  Global checkpointing for distributed programs , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[5]  B. R. Badrinath,et al.  Checkpointing distributed applications on mobile computers , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[6]  Michel Raynal,et al.  Consistent Checkpointing in Message Passing Distributed Systems , 1995 .

[7]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[8]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[9]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[10]  Lorenzo Alvisi,et al.  Message logging: pessimistic, optimistic, and causal , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[11]  Jian Xu,et al.  Sender-based message logging for reducing rollback propagation , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[12]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[13]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[14]  Mukesh Singhal,et al.  An Optimality Proof for Asynchronous Recovery Algorithms in Distributed Systems , 1995, Inf. Process. Lett..

[15]  W. Kent Fuchs,et al.  Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[16]  Achour Mostéfaoui,et al.  Characterization of consistent global checkpoints in large-scale distributed systems , 1995, Proceedings of the Fifth IEEE Computer Society Workshop on Future Trends of Distributed Computing Systems.

[17]  Arthur J. Bernstein,et al.  Efficient solutions to the replicated log and dictionary problems , 1984, PODC '84.

[18]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[19]  André Schiper,et al.  The Causal Ordering Abstraction and a Simple Way to Implement it , 1991, Inf. Process. Lett..

[20]  Taesoon Park,et al.  Checkpointing and rollback-recovery in distributed systems , 1989 .

[21]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[22]  W. Kent Fuchs,et al.  Optimistic message logging for independent checkpointing in message-passing systems , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.