Hybrid full/incremental checkpoint/restart for MPI jobs in HPC environments

As the number of cores in high-performance computing environments keeps increasing, faults are becoming common place. Checkpointing addresses such faults but captures full process images even though only a subset of the process image changes between checkpoints. We have designed a high-performance hybrid disk-based full/incremental checkpointing technique for MPI tasks to capture only data changed since the last checkpoint. Our implementation integrates new BLCR and LAM/MPI features that complement traditional full checkpoints. This results in significantly reduced checkpoint sizes and overheads with only moderate increases in restart overhead. After accounting for cost and savings, benefits due to incremental checkpoints significantly outweigh the loss on restart operations. Experiments in a cluster with the NAS Parallel Benchmark suite and mpiBLAST indicate that savings due to replacing full checkpoints with incremental ones average 16.64 seconds while restore overhead amounts to just 1.17 seconds. These savings increase with the frequency of incremental checkpoints. Overall, our novel hybrid full/incremental checkpointing is superior to prior nonhybrid techniques.

[1]  Andrew Lumsdaine,et al.  The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[2]  Philip S. Yu,et al.  Toward Predictive Failure Management for Distributed Stream Processing Systems , 2008, 2008 The 28th International Conference on Distributed Computing Systems.

[3]  RosenblumMendel,et al.  The design and implementation of a log-structured file system , 1991 .

[4]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[5]  Laxmikant V. Kale,et al.  Proactive Fault Tolerance in Large Systems , 2004 .

[6]  Yookun Cho,et al.  Space-Efficient Page-Level Incremental Checkpointing , 2006 .

[7]  Scott R. Kohn,et al.  Large scale parallel structured AMR calculations using the SAMRAI framework , 2001, SC.

[8]  Chao Wang,et al.  Scalable, fault tolerant membership for MPI tasks on HPC systems , 2006, ICS '06.

[9]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[10]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[11]  Yookun Cho,et al.  Adaptive page-level incremental checkpointing based on expected recovery time , 2006, SAC '06.

[12]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[13]  Laxmikant V. Kalé,et al.  A Fault Tolerance Protocol with Fast Fault Recovery , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[14]  A.M. Wissink,et al.  Large Scale Parallel Structured AMR Calculations Using the SAMRAI Framework , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[15]  A. Lumsdaine,et al.  A Checkpoint and Restart Service Specification for Open MPI , 2006 .

[16]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[17]  Rajeev Thakur,et al.  A Meta-Learning Failure Predictor for Blue Gene/L Systems , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[18]  Chao Wang,et al.  A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[19]  Anand Sivasubramaniam,et al.  Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[20]  Stephen L. Scott,et al.  A reliability-aware approach for an optimal checkpoint/restart model in HPC environments , 2007, 2007 IEEE International Conference on Cluster Computing.

[21]  Barton P. Miller,et al.  Process migration in DEMOS/MP , 1983, SOSP '83.

[22]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[23]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[24]  Laxmikant V. Kalé,et al.  Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.

[25]  Wu-chun Feng,et al.  A Power-Aware Run-Time System for High-Performance Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[26]  George Bosilca,et al.  Analysis of the Component Architecture Overhead in Open MPI , 2005, PVM/MPI.

[27]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[28]  Ruei-Chuan Chang,et al.  Continuous Checkpointing: Joining the Checkpointing with Virtual Memory Paging , 1997, Softw. Pract. Exp..

[29]  Stephen L. Scott,et al.  Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[30]  Christian Engelmann,et al.  Proactive process-level live migration in HPC environments , 2008, HiPC 2008.

[31]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[32]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[33]  John Daly A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps , 2003, International Conference on Computational Science.

[34]  Fred Douglis,et al.  Transparent process migration: Design alternatives and the sprite implementation , 1991, Softw. Pract. Exp..

[35]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[36]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[37]  Andrew Lumsdaine,et al.  A Component Architecture for LAM/MPI , 2003, PVM/MPI.

[38]  Remzi H. Arpaci-Dusseau,et al.  Architectural Requirements and Scalability of the NAS Parallel Benchmarks , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[39]  Dejan S. Milojicic,et al.  Process migration , 1999, ACM Comput. Surv..