An adaptive checkpointing protocol to bound recovery time with message logging

Numerous mathematical approaches have been proposed to determine the optimal checkpoint interval for minimizing total execution time of an application in the presence of failures. These solutions are often not applicable due to the lack of accurate data on the probability distribution of failures. Most current checkpoint libraries require application users to define a fixed time interval for checkpointing. The checkpoint interval usually implies the approximate maximum recovery time for single process applications. However, actual recovery time can be much smaller when message logging is used. Due to this faster recovery, checkpointing may be more frequent than needed and thus unnecessary execution overhead is introduced. In this paper, an adaptive checkpointing protocol is developed to accurately enforce the user-defined recovery time and to reduce excessive checkpoints. An adaptive protocol has been implemented and evaluated using a receiver-based message logging algorithm on wired and wireless mobile networks. The results show that the protocol precisely maintains the user-defined maximum recovery times for several traces with varying message exchange rates. The mechanism incurs lour overhead, avoids unnecessary checkpointing, and reduces failure free execution time.

[1]  Jian Xu,et al.  Adaptive independent checkpointing for reducing rollback propagation , 1993, Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing.

[2]  Nuno Neves,et al.  Using time to improve the performance of coordinated checkpointing , 1996, Proceedings of IEEE International Computer Performance and Dependability Symposium.

[3]  Luís Moura Silva,et al.  System-level versus user-defined checkpointing , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[4]  Nitin H. Vaidya,et al.  Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.

[5]  W. Kent Fuchs,et al.  PREACHES-portable recovery and checkpointing in heterogeneous systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[6]  Erik Seligman,et al.  Application Level Fault Tolerance in Heterogenous Networks of Workstations , 1997, J. Parallel Distributed Comput..

[7]  Erol Gelenbe,et al.  On the Optimum Checkpoint Interval , 1979, JACM.

[8]  Jehoshua Bruck,et al.  An On-Line Algorithm for Checkpoint Placement , 1997, IEEE Trans. Computers.

[9]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[10]  James S. Plank,et al.  Experimental assessment of workstation failures and their impact on checkpointing systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[11]  Lorenzo Alvisi,et al.  Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[12]  Victor F. Nicola,et al.  Comparative Analysis of Different Models of Checkpointing and Recovery , 1990, IEEE Trans. Software Eng..

[13]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[14]  Mukesh Singhal,et al.  Using logging and asynchronous checkpointing to implement recoverable distributed shared memory , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[15]  W. Kent Fuchs,et al.  Branch recovery with compiler-assisted multiple instruction retry , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[16]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[17]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[18]  Golden G. Richard,et al.  On patterns for practical fault tolerant software in Java , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[19]  Jacques Malenfant,et al.  Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems , 1988, IEEE Trans. Computers.

[20]  W. Kent Fuchs,et al.  Message logging in mobile computing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[21]  Edward G. Coffman,et al.  Optimal strategies for scheduling checkpoints and preventive maintenance , 1990 .

[22]  Nian-Feng Tzeng,et al.  Logging and recovery in adaptive software distributed shared memory systems , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[23]  Nuno Neves,et al.  Coordinated checkpointing without direct coordination , 1998, Proceedings. IEEE International Computer Performance and Dependability Symposium. IPDS'98 (Cat. No.98TB100248).

[24]  W. Kent Fuchs,et al.  Compiler‐assisted full checkpointing , 1994, Softw. Pract. Exp..

[25]  W. Kent Fuchs,et al.  Reduced overhead logging for rollback recovery in distributed shared memory , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[26]  Robert Geist,et al.  Selection of a checkpoint interval in a critical-task environment , 1988 .

[27]  Andrzej Duda,et al.  The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..

[28]  Özalp Babaoglu,et al.  On the Optimum Checkpoint Selection Problem , 1984, SIAM J. Comput..

[29]  Jehoshua Bruck,et al.  An on-line algorithm for checkpoint placement , 1996, Proceedings of ISSRE '96: 7th International Symposium on Software Reliability Engineering.

[30]  Mukesh Singhal,et al.  Low-cost checkpointing with mutable checkpoints in mobile computing systems , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[31]  William H. Sanders,et al.  Performance analysis of two time-based coordinated checkpointing protocols , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[32]  Jacob A. Abraham,et al.  Compiler-assisted static checkpoint insertion , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[33]  Micah Beck,et al.  Compiler-Assisted Checkpointing , 1994 .

[34]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[35]  B. Ramkumar,et al.  Portable checkpointing for heterogeneous architectures , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[36]  Nuno Neves,et al.  RENEW: a tool for fast and efficient implementation of checkpoint protocols , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).