Comprehensive evaluation of aperiodic checkpointing and rejuvenation schemes in operational software system

This paper examines comprehensive evaluation of aperiodic time-based checkpointing and rejuvenation schemes maximizing the steady-state system availability in an operational software system. We consider two kinds of maintenance policies: checkpointing prior to rejuvenating (CPTR) and rejuvenating prior to checkpointing (RPTC). These schemes are complementary from each other to schedule checkpoints and rejuvenation points. In addition, under a periodic full maintenance operation, we show that aperiodic checkpointing or rejuvenation scheme is optimal to maximize the steady-state system availability by applying the dynamic programming. In numerical examples, CPTR and RPTC are comparatively examined with same overhead parameters, and the effects of CPTR and RPTC on maximizing the steady-state system availability are investigated.

[1]  Tadashi Dohi,et al.  Fine-Grained Shock Models to Rejuvenate Software Systems , 2003 .

[2]  Satoshi Fukumoto,et al.  A study of checkpoint generations for a database recovery mechanism , 1992 .

[3]  Victor F. Nicola,et al.  Checkpointing and the modeling of program execution time , 1994 .

[4]  François Baccelli Analysis of a service facility with periodic checkpointing , 2004, Acta Informatica.

[5]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  Edward N. Adams,et al.  Optimizing Preventive Service of Software Products , 1984, IBM J. Res. Dev..

[8]  Matteo Sereno,et al.  Fine Grained Software Degradation Models for Optimal Rejuvenation Policies , 2001, Perform. Evaluation.

[9]  Kishor S. Trivedi,et al.  A comprehensive model for software rejuvenation , 2005, IEEE Transactions on Dependable and Secure Computing.

[10]  Philipp Reinecke,et al.  A Measurement Study of the Interplay Between Application Level Restart and Transport Protocol , 2004, ISAS.

[11]  Tadashi Dohi,et al.  Dependability analysis of a client/server software system with rejuvenation , 2002, 13th International Symposium on Software Reliability Engineering, 2002. Proceedings..

[12]  Katinka Wolter,et al.  Analysis of Restart Mechanisms in Software Systems , 2006, IEEE Transactions on Software Engineering.

[13]  Ann T. Tai,et al.  On-Board Preventive Maintenance: A Design-Oriented Analytic Study for Long-Life Applications , 1999, Perform. Evaluation.

[14]  Z. A. Lomnicki,et al.  Mathematical Theory of Reliability , 1966 .

[15]  Jacques Malenfant,et al.  Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems , 1988, IEEE Trans. Computers.

[16]  Kishor S. Trivedi,et al.  A workload-based analysis of software aging, and rejuvenation , 2005, IEEE Transactions on Reliability.

[17]  P.B. Goes,et al.  Stochastic Models for Performance Analysis of Database Recovery Control , 1995, IEEE Trans. Computers.

[18]  Jehoshua Bruck,et al.  An on-line algorithm for checkpoint placement , 1996, Proceedings of ISSRE '96: 7th International Symposium on Software Reliability Engineering.

[19]  Tadashi Dohi,et al.  Optimal Checkpoint Placement with Equality Constraints , 2006, 2006 2nd IEEE International Symposium on Dependable, Autonomic and Secure Computing.

[20]  Paulo B. Góes A Stochastic Model for Performance Evaluation of Main Memory Resident Database Systems , 1995, INFORMS J. Comput..

[21]  Katinka Wolter,et al.  Optimal restart times for moments of completion time , 2004, IEE Proc. Softw..

[22]  Tadashi Dohi,et al.  Dependability analysis of transaction-based multi-server system with rejuvenation , 2003 .

[23]  Tadashi Dohi,et al.  Availability models with age-dependent checkpointing , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[24]  Vincenzo Grassi,et al.  On the Optimal Checkpointing of Critical Tasks and Transaction-Oriented Systems , 1992, IEEE Trans. Software Eng..

[25]  Kishor S. Trivedi,et al.  Proactive management of software aging , 2001, IBM J. Res. Dev..

[26]  Michael R. Lyu Software Fault Tolerance , 1995 .

[27]  Kishor S. Trivedi,et al.  Optimal Software Rejuvenation for Tolerating Soft Failures , 1996, Perform. Evaluation.

[28]  Satoshi Fukumoto,et al.  Optimal checkpointing policies using the checkpointing density , 1992 .

[29]  Tadashi Dohi,et al.  A Faster Estimation Algorithm for Periodic Preventive Rejuvenation Schedule Maximizing System Availability , 2007, ISAS.

[30]  Tadashi Dohi,et al.  Analysis of a Software System with Rejuvenation, Restoration and Checkpointing , 2008, ISAS.

[31]  Kishor S. Trivedi,et al.  Analysis of Preventive Maintenance in Transactions Based Software Systems , 1998, IEEE Trans. Computers.

[32]  Byron S. Gottfried Technical Note - A Stopping Criterion for the Golden-Ratio Search , 1975, Oper. Res..

[33]  Tadashi Dohi,et al.  Availability optimization in operational software system with aperiodic time-based software rejuvenation scheme , 2008, 2008 IEEE International Conference on Software Reliability Engineering Workshops (ISSRE Wksp).

[34]  Kishor S. Trivedi,et al.  Analysis of software rejuvenation using Markov Regenerative Stochastic Petri Net , 1995, Proceedings of Sixth International Symposium on Software Reliability Engineering. ISSRE'95.

[35]  Kishor S. Trivedi,et al.  Minimizing completion time of a program by checkpointing and rejuvenation , 1996, SIGMETRICS '96.

[36]  Tadashi Dohi,et al.  Distribution-free checkpoint placement algorithms based on min-max principle , 2006, IEEE Transactions on Dependable and Secure Computing.

[37]  Wei Xie,et al.  Performability analysis of clustered systems with rejuvenation under varying workload , 2007, Perform. Evaluation.

[38]  Xiaola Lin,et al.  A Variational Calculus Approach to Optimal Checkpoint Placement , 2001, IEEE Trans. Computers.

[39]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[40]  Erol Gelenbe,et al.  On the Optimum Checkpoint Interval , 1979, JACM.

[41]  Kishor S. Trivedi,et al.  Optimal rejuvenation for tolerating soft failures , 1996 .

[42]  G. V. Kulkarni,et al.  Effects of Checkpointing and Queueing on Program Performance , 1987 .

[43]  Tadashi Dohi,et al.  Behavioral analysis of a fault-tolerant software system with rejuvenation , 2005, Proceedings Autonomous Decentralized Systems, 2005. ISADS 2005..

[44]  Hiroaki Suzuki,et al.  Comparing Software Rejuvenation Policies under Different Dependability Measures , 2004, IEICE Trans. Inf. Syst..

[45]  Tadashi Dohi,et al.  Behavioral Analysis of a Fault-Tolerant Software System with Rejuvenation , 2005, IEICE Trans. Inf. Syst..

[46]  Nitin H. Vaidya,et al.  Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.

[47]  Elaine J. Weyuker,et al.  Monitoring Smoothly Degrading Systems for Increased Dependability , 2004, Empirical Software Engineering.

[48]  Erol Gelenbe,et al.  Performance of rollback recovery systems under intermittent failures , 1978, CACM.

[49]  Kishor S. Trivedi,et al.  Fighting bugs: remove, retry, replicate, and rejuvenate , 2007, Computer.

[50]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[51]  Tadashi Dohi,et al.  The optimal age-dependent checkpoint strategy for a stochastic system subject to general failure mode , 2000 .

[52]  K. Mani Chandy,et al.  Analytic models for rollback and recovery strategies in data base systems , 1975, IEEE Transactions on Software Engineering.

[53]  Matteo Sereno,et al.  Modeling software systems with rejuvenation, restoration and checkpointing through fluid stochastic Petri nets , 1999, Proceedings 8th International Workshop on Petri Nets and Performance Models (Cat. No.PR00331).

[54]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[55]  Tadashi Dohi,et al.  Rejuvenating Communication Network System under Burst Arrival Circumstances , 2005, IEICE Trans. Commun..

[56]  Tadashi Dohi,et al.  Estimating Software Rejuvenation Schedules in High-Assurance Systems , 2001, Comput. J..

[57]  Victor F. Nicola,et al.  Comparative Analysis of Different Models of Checkpointing and Recovery , 1990, IEEE Trans. Software Eng..

[58]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[59]  Özalp Babaoglu,et al.  On the Optimum Checkpoint Selection Problem , 1984, SIAM J. Comput..

[60]  K. Mani Chandy,et al.  A Survey of Analytic Models of Rollback and Recovery Stratergies , 1975, Computer.

[61]  Erol Gelenbe,et al.  Optimum checkpoints with age dependent failures , 2004, Acta Informatica.

[62]  Kishor S. Trivedi,et al.  Analysis and implementation of software rejuvenation in cluster systems , 2001, SIGMETRICS '01.

[63]  Tadashi Dohi,et al.  Analysis of a Service Degradation Model with Preventive Rejuvenation , 2006, ISAS.

[64]  Miroslaw Malek,et al.  Service Availability, First International Service Availability Symposium, ISAS 2004, Munich, Germany, May 13-14, 2004, Revised Selected Papers , 2005, ISAS.

[65]  Tadashi Dohi,et al.  A dynamic checkpointing scheme based on reinforcement learning , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[66]  Tadashi Dohi,et al.  A DP-BASED CHECKPOINTING SCHEME IN REAL-TIME APPLICATIONS , 2006 .

[67]  Matteo Sereno,et al.  Compositional fluid stochastic Petri net model for operational software system performance , 2008, 2008 IEEE International Conference on Software Reliability Engineering Workshops (ISSRE Wksp).