A Variational Calculus Approach to Optimal Checkpoint Placement

Checkpointing is an effective fault-tolerant technique for improving system availability and reliability. However, a blind checkpointing placement can result in either performance degradation or expensive recovery cost. By means of the calculus of variations, we derive an explicit formula that links the optimal checkpointing frequency with a general failure rate, with the objective of globally minimizing the total expected cost of checkpointing and recovery. Theoretical result shows that the optimal checkpointing frequency is proportional to the square root of the failure rate and can be uniquely determined by the failure rate (time-varying or constant) if the recovery function is strictly increasing and the failure rate is /spl lambda/(/spl infin/)>0. J.L. Bruno and E.G. Coffman (1997) suggest that optimal checkpointing by its nature is a function of system failure rate, i.e., the time-varying failure rate demands time-varying checkpointing in order to meet the criteria of certain optimality. The results obtained in this paper agree with their viewpoint.

[1]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[2]  Vincenzo Grassi,et al.  On the Optimal Checkpointing of Critical Tasks and Transaction-Oriented Systems , 1992, IEEE Trans. Software Eng..

[3]  Edward G. Coffman,et al.  Optimal fault-tolerant computing on multiprocessor systems , 1997, Acta Informatica.

[4]  Edward G. Coffman,et al.  Scheduling Checks and Saves , 1992, INFORMS J. Comput..

[5]  Edmundo de Souza e Silva,et al.  Calculating Cumulative Operational Time Distributions of Repairable Computer Systems , 1986, IEEE Transactions on Computers.

[6]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[7]  Victor F. Nicola,et al.  Checkpointing and the modeling of program execution time , 1994 .

[8]  Edward G. Coffman,et al.  A Stochastic Checkpoint Optimization Problem , 1993, SIAM J. Comput..

[9]  Andrzej Duda,et al.  The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..

[10]  Jacques Malenfant,et al.  Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems , 1988, IEEE Trans. Computers.

[11]  Jie Mi Interval estimation of availability of a series system , 1991 .

[12]  Kang G. Shin,et al.  Optimal Checkpointing of Real-Time Tasks , 1987, IEEE Transactions on Computers.

[13]  K. Mani Chandy,et al.  Analytic models for rollback and recovery strategies in data base systems , 1975, IEEE Transactions on Software Engineering.

[14]  Jeffrey C. Lagarias,et al.  Processor Shadowing: Maximizing Expected Throughput in Fault-Tolerant Systems , 1999, Math. Oper. Res..

[15]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[16]  Victor F. Nicola,et al.  Comparative Analysis of Different Models of Checkpointing and Recovery , 1990, IEEE Trans. Software Eng..

[17]  Asser N. Tantawi,et al.  Reliability of Systems with Limited Repairs , 1987, IEEE Transactions on Reliability.

[18]  Edward G. Coffman,et al.  Optimal strategies for scheduling checkpoints and preventive maintenance , 1990 .

[19]  Özalp Babaoglu,et al.  On the Optimum Checkpoint Selection Problem , 1984, SIAM J. Comput..

[20]  K. Mani Chandy,et al.  A Survey of Analytic Models of Rollback and Recovery Stratergies , 1975, Computer.

[21]  Ushio Sumita,et al.  Analysis of effective service time with age dependent interruptions and its application to optimal rollback policy for database management , 1989, Queueing Syst. Theory Appl..

[22]  Kang G. Shin,et al.  Optimization criteria for checkpoint placement , 1984, CACM.

[23]  Clement H. C. Leung,et al.  On the Execution of Large Batch Programs in Unreliable Computing Systems , 1984, IEEE Transactions on Software Engineering.