Heterogeneous 1-out-of-N warm standby systems with online checkpointing

Abstract As a common practice in computing-related applications, checkpointing is used to facilitate an effective system recovery in the case of the occurrence of failures. Checkpoints are performed to save data associated with completed portion of a mission task. In the case of a failure, through rollback and data retrieval the system can resume the mission task from the last successful checkpoint instead of from the very beginning of the mission, saving time and cost. This paper models and optimizes 1-out-of- N : G warm standby systems subject to uneven online checkpointing, where checkpoints can be performed in parallel with execution of the primary mission task for improving efficiency of computing elements. Both data checkpoint and retrieval take dynamic time, depending on the amount of work completed. System elements can be heterogeneous in the time-to-failure distribution, performance, and level of readiness to take over the mission task during the warm standby mode. A numerical method is first suggested to evaluate mission performance indices including mission success probability, expected mission completion time, and expected mission operation cost. Examples are provided to demonstrate influence of mission deadline and element resource sharing parameter (i.e., CPU time distribution between the checkpointing procedure and the primary mission task) on the mission performance metrics. The optimal checkpoint distribution and optimal element activation sequencing problems are considered for different combinations of optimization objectives and constraints. A co-optimization problem is further addressed, which aims to find the optimal combination of checkpoint distribution and element activation sequence. Example optimization solutions illustrate the tradeoff among the three mission requirements (reliability, completion time, operation cost) for warm standby systems with online checkpoints.

[1]  Ajay Khunteta,et al.  An Analysis of Checkpointing Algorithms for Distributed Mobile Systems , 2010 .

[2]  Ajit Srividya,et al.  Dynamic fault tree analysis using Monte Carlo simulation in probabilistic safety assessment , 2009, Reliab. Eng. Syst. Saf..

[3]  Tadashi Dohi,et al.  Comprehensive evaluation of aperiodic checkpointing and rejuvenation schemes in operational software system , 2010, J. Syst. Softw..

[4]  Gregory Levitin,et al.  Optimal sequencing of warm standby elements , 2013, Comput. Ind. Eng..

[5]  W. Meeker Accelerated Testing: Statistical Models, Test Plans, and Data Analyses , 1991 .

[6]  Gregory Levitin,et al.  Heterogeneous 1-Out-of-N Warm Standby Systems With Dynamic Uneven Backups , 2015, IEEE Transactions on Reliability.

[7]  P.B. Goes,et al.  Stochastic Models for Performance Analysis of Database Recovery Control , 1995, IEEE Trans. Computers.

[8]  Gregory Levitin,et al.  Sequencing Optimization in k-out-of-n Cold-Standby Systems Considering Mission Cost , 2013, Int. J. Gen. Syst..

[9]  Gregory Levitin,et al.  Optimal Distribution of Nonperiodic Full and Incremental Backups , 2017, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[10]  Barry Johnson,et al.  Fault Tolerant Computer System for the A129 Helicopter , 1985, IEEE Transactions on Aerospace and Electronic Systems.

[11]  Meeta Sharma Gupta,et al.  Performance implications of periodic checkpointing on large-scale cluster systems , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[12]  Pham Hoang,et al.  Tampered Failure Rate Load-Sharing Systems: Status and Perspectives , 2008 .

[13]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[14]  Liudong Xing,et al.  Reliability of warm-standby systems subject to imperfect fault coverage , 2014 .

[15]  Stephen L. Scott,et al.  Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[16]  Ying Zhang,et al.  Adaptive Checkpointing with Dynamic Voltage Scaling in Embedded Real-Time Systems , 2003, Embedded Software for SoC.

[17]  William Stallings Computer Organization and Architecture , 2002 .

[18]  Way Kuo,et al.  Recent Advances in Optimal Reliability Allocation , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[19]  Liudong Xing,et al.  Binary decision diagram-based reliability evaluation of k-out-of-(n + k) warm standby systems subject to fault-level coverage , 2013 .

[20]  Miklós Kozlovszky,et al.  Provenance Based Checkpointing Method for Dynamic Health Care Smart System , 2016, Scalable Comput. Pract. Exp..

[21]  Ewan Macarthur,et al.  Accelerated Testing: Statistical Models, Test Plans, and Data Analysis , 1990 .

[22]  Gregory Levitin,et al.  Optimization of Full versus Incremental Periodic Backup Policy , 2016, IEEE Transactions on Dependable and Secure Computing.

[23]  Gregory Levitin,et al.  Cold Standby Systems With Imperfect Backup , 2016, IEEE Transactions on Reliability.

[24]  Tadashi Dohi,et al.  A dynamic checkpointing scheme based on reinforcement learning , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[25]  Victor F. Nicola,et al.  Comparative Analysis of Different Models of Checkpointing and Recovery , 1990, IEEE Trans. Software Eng..

[26]  Gregory Levitin,et al.  Optimal component loading in 1-out-of-N cold standby systems , 2014, Reliab. Eng. Syst. Saf..

[27]  Qiqi Wang,et al.  Minimal Repetition Dynamic Checkpointing Algorithm for Unsteady Adjoint Calculation , 2009, SIAM J. Sci. Comput..

[28]  Gregory Levitin,et al.  Heterogeneous Warm Standby Multi-Phase Systems With Variable Mission Time , 2016, IEEE Transactions on Reliability.

[29]  Ritu Garg,et al.  Fault TOLERANCE IN GRID COMPUTING : STATE OF THE ART AND OPEN ISSUES , 2011 .

[30]  Serkan Eryilmaz The behavior of warm standby components with respect to a coherent system , 2011 .

[31]  Chao Wang,et al.  Hybrid Checkpointing for MPI Jobs in HPC Environments , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[32]  Tadashi Dohi,et al.  Aperiodic Checkpoint Placement Algorithms—Survey and Comparison , 2013 .

[33]  Kang G. Shin,et al.  Optimal Checkpointing of Real-Time Tasks , 1987, IEEE Transactions on Computers.

[34]  Tadashi Dohi,et al.  A DP-BASED CHECKPOINTING SCHEME IN REAL-TIME APPLICATIONS , 2006 .

[35]  Gregory Levitin,et al.  Non-Homogeneous 1-Out-of-${N}$ Warm Standby Systems With Random Replacement Times , 2015, IEEE Transactions on Reliability.

[36]  S. Tokumoto,et al.  Towards Development of Risk-based Checkpointing Scheme Via Parametric Bootstrapping , 2012, 2012 Workshop on Dependable Transportation Systems/Recent Advances in Software Dependability.

[37]  Jung-Min Yang,et al.  Optimal Checkpoint Placement on Real-Time Tasks with Harmonic Periods , 2012, Journal of Computer Science and Technology.

[38]  Tadashi Dohi,et al.  Availability models with age-dependent checkpointing , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[39]  Gregory Levitin Genetic algorithms in reliability engineering , 2006, Reliab. Eng. Syst. Saf..

[40]  Liudong Xing,et al.  Mission Reliability, Cost and Time for Cold Standby Computing Systems with Periodic Backup , 2015, IEEE Transactions on Computers.

[41]  Min Xie,et al.  Availability and reliability of k-out-of-(M+N): G warm standby systems , 2006, Reliab. Eng. Syst. Saf..