Computing Optimal Checkpointing Policies: A Dynamic Programming Approach

Rollback and recovery is a widely used error recovery technique in database systems. This paper presents a numerical approach to compute optimal checkpointing policies for general rollback and recovery models. The approach is based on Markov renewal programming. General failure distributions, random checkpointing durations and reprocessing dependent recovery times are allowed. The proposed algorithm is based on value iteration dynamic programming with spline interpolation of the value and policy functions. The objective is to maximize average system availability over an infinite time horizon. The algorithm has been implemented successfully, and a numerical illustration is provided.

[1]  Özalp Babaoglu,et al.  On the Optimum Checkpoint Selection Problem , 1984, SIAM J. Comput..

[2]  K. Mani Chandy,et al.  A Survey of Analytic Models of Rollback and Recovery Stratergies , 1975, Computer.

[3]  Andrzej Duda,et al.  Performance Analysis of the Checkpoint-Rollback-Recovery System via Diffusion Approximation , 1983, Computer Performance and Reliability.

[4]  Erol Gelenbe,et al.  Performance of rollback recovery systems under intermittent failures , 1978, CACM.

[5]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[6]  Joost Verhofstad,et al.  Recovery Techniques for Database Systems , 1978, CSUR.

[7]  Guy M. Lohman,et al.  Optimal policy for batch operations: backup, checkpointing, reorganization, and updating , 1977, TODS.

[8]  K. Mani Chandy,et al.  Analytic models for rollback and recovery strategies in data base systems , 1975, IEEE Transactions on Software Engineering.

[9]  Andreas Reuter,et al.  Performance analysis of recovery techniques , 1984, TODS.

[10]  Michael J. Magazine,et al.  Optimality of Intuitive Checkpointing Policies , 1983, Inf. Process. Lett..

[11]  P. L’Ecuyer,et al.  Approximation and bounds in discrete event dynamic programming , 1986 .

[12]  Erol Gelenbe,et al.  On the Optimum Checkpoint Interval , 1979, JACM.

[13]  Kang G. Shin,et al.  Optimization criteria for checkpoint placement , 1984, CACM.

[14]  Asser Nasr-El-Din Tantawi Performance analysis of rollback recovery systems and breakdown queueing networks , 1982 .

[15]  Daniel P. Siewiorek,et al.  Derivation and Calibration of a Transient Error Reliability Model , 1982, IEEE Transactions on Computers.

[16]  P. Schweitzer Iterative solution of the functional equations of undiscounted Markov renewal programming , 1971 .

[17]  Isi Mitrani,et al.  Analysis and Optimum Performance of Two Message-Passing Parallel Processors Synchronized by Rollback , 1984, Perform. Evaluation.

[18]  Pierre L'Ecuyer,et al.  Processus de décision markoviens à étapes discrètes: Application à des problèmes de remplacement d'équipement , 1983 .

[19]  Victor F. Nicola,et al.  A Model of Checkpointing and Recovery with a Specified Number of Transactions between Checkpoints , 1983, Performance.