Numerical computation algorithms for sequential checkpoint placement

This paper concerns sequential checkpoint placement problems under two dependability measures: steady-state system availability and expected reward per unit time in the steady state. We develop numerical computation algorithms to determine the optimal checkpoint sequence, based on the classical Brender's fixed point algorithm and further give three simple approximation methods. Numerical examples with the Weibull failure time distribution are devoted to illustrate quantitatively the overestimation and underestimation of the sub-optimal checkpoint sequences based on the approximation methods.

[1]  Özalp Babaoglu,et al.  On the Optimum Checkpoint Selection Problem , 1984, SIAM J. Comput..

[2]  Mark A. Franklin,et al.  Checkpointing in Distributed Computing Systems , 1996, J. Parallel Distributed Comput..

[3]  James S. Plank,et al.  Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[4]  Kishor S. Trivedi,et al.  Proactive management of software aging , 2001, IBM J. Res. Dev..

[5]  K. Mani Chandy,et al.  A Survey of Analytic Models of Rollback and Recovery Stratergies , 1975, Computer.

[6]  P.B. Goes,et al.  Stochastic Models for Performance Analysis of Database Recovery Control , 1995, IEEE Trans. Computers.

[7]  Erol Gelenbe,et al.  Optimum checkpoints with age dependent failures , 2004, Acta Informatica.

[8]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[9]  Tadashi Dohi,et al.  Distribution-free checkpoint placement algorithms based on min-max principle , 2006, IEEE Transactions on Dependable and Secure Computing.

[10]  Guy M. Lohman,et al.  Optimal policy for batch operations: backup, checkpointing, reorganization, and updating , 1977, TODS.

[11]  Erol Gelenbe,et al.  On the Optimum Checkpoint Interval , 1979, JACM.

[12]  K. Mani Chandy,et al.  Analytic models for rollback and recovery strategies in data base systems , 1975, IEEE Transactions on Software Engineering.

[13]  Tadashi Dohi,et al.  A dynamic checkpointing scheme based on reinforcement learning , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[14]  Jehoshua Bruck,et al.  An On-Line Algorithm for Checkpoint Placement , 1997, IEEE Trans. Computers.

[15]  G. V. Kulkarni,et al.  Effects of Checkpointing and Queueing on Program Performance , 1987 .

[16]  Marvin Zelen,et al.  Mathematical Theory of Reliability , 1965 .

[17]  Darrell D. E. Long,et al.  A longitudinal survey of Internet host reliability , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[18]  Sayori Nakagawa,et al.  Optimal Checkpoint Policies Attending with Unsuccessful Rollback Recovery , 1997 .

[19]  Roy Friedman,et al.  Quantifying rollback propagation in distributed checkpointing , 2001, Proceedings 20th IEEE Symposium on Reliable Distributed Systems.

[20]  Jacques Malenfant,et al.  Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems , 1988, IEEE Trans. Computers.

[21]  Nitin H. Vaidya,et al.  Staggered Consistent Checkpointing , 1999, IEEE Trans. Parallel Distributed Syst..

[22]  François Baccelli Analysis of a service facility with periodic checkpointing , 2004, Acta Informatica.

[23]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[24]  Tadashi Dohi,et al.  Optimal age-dependent checkpoint strategy with retry of rollback recovery , 2002, The 2nd International Workshop on Autonomous Decentralized System, 2002..

[25]  Michael R. Lyu Software Fault Tolerance , 1995 .

[26]  Shunji Osaki,et al.  A note on optimum checkpointing policies , 1985 .

[27]  Xiaola Lin,et al.  A Variational Calculus Approach to Optimal Checkpoint Placement , 2001, IEEE Trans. Computers.

[28]  Erol Gelenbe,et al.  Performance of rollback recovery systems under intermittent failures , 1978, CACM.

[29]  Roy Friedman,et al.  Evaluating distributed checkpointing protocols , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[30]  Kishor S. Trivedi,et al.  Stochastic Modeling Formalisms for Dependability, Performance and Performability , 2000, Performance Evaluation.

[31]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[32]  Vincenzo Grassi,et al.  On the Optimal Checkpointing of Critical Tasks and Transaction-Oriented Systems , 1992, IEEE Trans. Software Eng..

[33]  Andrzej Duda,et al.  The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..

[34]  Adel Said Elmaghraby,et al.  An Analytical Model for Hybrid Checkpointing in Time Warp Distributed Simulation , 1998, IEEE Trans. Parallel Distributed Syst..

[35]  Victor F. Nicola,et al.  Comparative Analysis of Different Models of Checkpointing and Recovery , 1990, IEEE Trans. Software Eng..

[36]  Tadashi Dohi,et al.  Availability models with age-dependent checkpointing , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[37]  Ushio Sumita,et al.  Analysis of effective service time with age dependent interruptions and its application to optimal rollback policy for database management , 1989, Queueing Syst. Theory Appl..

[38]  John F. Meyer,et al.  On Evaluating the Performability of Degradable Computing Systems , 1980, IEEE Transactions on Computers.

[39]  Nitin H. Vaidya,et al.  Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.

[40]  W. Kent Fuchs,et al.  Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems , 1995, IEEE Trans. Parallel Distributed Syst..