Complexity Analysis of Checkpoint Scheduling with Variable Costs

The parallel computing platforms available today are increasingly larger and thus, more and more subject to failures. Consequently it is necessary to develop efficient strategies providing safe and reliable completion for HPC parallel applications. Checkpointing is one of the most popular and efficient technique for developing fault-tolerant applications on such context. However, checkpoint operations are costly in terms of time, computation, and network communication. This will certainly affect the global performance of the application. In this work, we propose a performance model that expresses formally the checkpoint scheduling problem. This model exhibits the tradeoff between the impact of the checkpoints operations and the lost computation due to failures. Based on this model, we study the computational complexity of the problem of scheduling checkpoints with variable costs for general failure distributions. More precisely, we provide a new computational complexity analysis that explicits in depth the relations between the probabilistic failure model, the checkpoint cost, and the computational model. In particular, we prove that the checkpoint scheduling problem is NP-hard even in the simple case of uniform failure distribution. We also present a dynamic programming scheme for determining the optimal checkpointing times in all the variants of the problem.

[1]  Denis Trystram,et al.  Analyzing scheduling with transient failures , 2009, Inf. Process. Lett..

[2]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[3]  Rudolf Eigenmann,et al.  Failure-aware checkpointing in fine-grained cycle sharing systems , 2007, HPDC '07.

[4]  Kishor S. Trivedi,et al.  Minimizing completion time of a program by checkpointing and rejuvenation , 1996, SIGMETRICS '96.

[5]  Thierry Gautier,et al.  Optimised Recovery with a Coordinated Checkpoint/Rollback Protocol for Domain Decomposition Applications , 2008, MCO.

[6]  Tadashi Dohi,et al.  Distribution-free checkpoint placement algorithms based on min-max principle , 2006, IEEE Transactions on Dependable and Secure Computing.

[7]  Thomas Hérault,et al.  MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI , 2006, Int. J. High Perform. Comput. Appl..

[8]  Tadashi Dohi,et al.  Numerical computation algorithms for sequential checkpoint placement , 2009, Perform. Evaluation.

[9]  Peter H. Beckman,et al.  Understanding Checkpointing Overheads on Massive-Scale Systems: Analysis of the IBM Blue Gene/P System , 2010, Int. J. High Perform. Comput. Appl..

[10]  Zizhong Chen,et al.  Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing , 2009, IEEE Transactions on Computers.

[11]  Tadashi Dohi,et al.  Optimal Checkpoint Placement with Equality Constraints , 2006, 2006 2nd IEEE International Symposium on Dependable, Autonomic and Secure Computing.

[12]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[13]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[14]  Franck Cappello,et al.  Distributed Diskless Checkpoint for Large Scale Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[15]  Jean-Marc Vincent,et al.  A Flexible Checkpoint/Restart Model in Distributed Systems , 2009, PPAM.

[16]  Nitin H. Vaidya,et al.  Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.

[17]  Henri Casanova,et al.  Checkpointing strategies for parallel jobs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[19]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[20]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[21]  Inna K. Shingareva,et al.  Numerical Analysis and Scientific Computing , 2006 .

[22]  Franck Cappello,et al.  Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[23]  D. Nurmi Model-Based Checkpoint Scheduling for Volatile Resource Environments , 2004 .

[24]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[25]  Xiaola Lin,et al.  A Variational Calculus Approach to Optimal Checkpoint Placement , 2001, IEEE Trans. Computers.

[26]  Andrew Lumsdaine,et al.  The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[27]  James S. Plank,et al.  Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[28]  Xian-He Sun,et al.  Optimizing HPC Fault-Tolerant Environment: An Analytical Approach , 2010, 2010 39th International Conference on Parallel Processing.

[29]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[30]  Özalp Babaoglu,et al.  On the Optimum Checkpoint Selection Problem , 1984, SIAM J. Comput..

[31]  Andrzej Duda,et al.  The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..

[32]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..