A new, efficient coordinated checkpointing protocol combined with selective sender-based message logging

Checkpointing and message logging are the popular and general-purpose tools for providing fault- tolerance in distributed systems. The most of the Coordinated checkpointing algorithms available in the literature have not addressed about treatment of the lost messages and these algorithms suffer from high output commit latency. To overcome the above limitations, we propose a new coordinated checkpointing protocol combined with selective sender-based message logging. The protocol is free from the problem of lost messages. The term 'selective' implies that messages are logged only within a specified interval known as active interval, thereby reducing message logging overhead. All processes take checkpoints at the end of their respective active intervals forming a consistent global state. Outside the active interval there is no checkpointing of process state. This protocol minimizes different overheads i.e. checkpointing overhead, message logging overhead, recovery overhead and blocking overhead. Unlike blocking coordinated checkpointing, the disk contentions are less in the proposed protocol.

[1]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[2]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[3]  Willy Zwaenepoel,et al.  On the use and implementation of message logging , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[4]  Jiannong Cao,et al.  Design and analysis of an efficient algorithm for coordinated checkpointing in distributed systems , 1997, Proceedings. Advances in Parallel and Distributed Computing.

[5]  Lorenzo Alvisi,et al.  Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[6]  Bharat K. Bhargava,et al.  Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[7]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[8]  Dhiraj K. Pradhan,et al.  An efficient coordinated checkpointing scheme for multicomputers , 1994, Proceedings of IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems.

[9]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[10]  Jian Xu,et al.  Sender-based message logging for reducing rollback propagation , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[11]  Luís Moura Silva,et al.  Using message semantics for fast-output commit in checkpointing-and-rollback recovery , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[12]  Mukesh Singhal,et al.  On Coordinated Checkpointing in Distributed Systems , 1998, IEEE Trans. Parallel Distributed Syst..