Improving an MPI Application-Level Migration Approach through Checkpoint File Splitting

Traditionally used for load balancing, process migration has been gaining popularity in the fault tolerance context. Recently, checkpoint-based migration has been proposed to implement failure avoidance in MPI applications through the proactive migration of processes when impending failures are notified. However, the main drawback of checkpoint-based migration in these scenarios is its high I/0 cost, which may be unfeasible if the migration operation is not completed before the failure arises. To overcome this issue, this work proposes to split the checkpoint files of an application-level migration approach into multiple smaller files to overlap the different phase of the migration operation: checkpoint file writing in the terminating process, with data transferring through the network, and state file read and restart operations in the new spawned processes. The proposal has been tested using the MPI NAS Parallel Benchmarks. The experimental results show a significant reduction in the migration time.

[1]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[2]  Dhabaleswar K. Panda,et al.  RDMA-Based Job Migration Framework for MPI over InfiniBand , 2010, 2010 IEEE International Conference on Cluster Computing.

[3]  Dhabaleswar K. Panda,et al.  High Performance Pipelined Process Migration with RDMA , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[4]  Roberto R. Osorio,et al.  Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes , 2013, New Generation Computing.

[5]  Gabriel Rodríguez,et al.  CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications , 2010, Concurr. Comput. Pract. Exp..

[6]  Gabriel Rodríguez,et al.  Failure Avoidance in MPI Applications Using an Application-Level Approach , 2014, Comput. J..

[7]  Laxmikant V. Kalé,et al.  Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.

[8]  Gabriel Rodríguez,et al.  Analysis of Performance-impacting Factors on Checkpointing Frameworks: The CPPC Case Study , 2011, Comput. J..

[9]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[10]  Gabriel Rodríguez,et al.  In-memory application-level checkpoint-based migration for MPI programs , 2014, The Journal of Supercomputing.

[11]  Cong Du,et al.  MPI-Mitten: Enabling Migration Technology in MPI , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[12]  Chao Wang,et al.  Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Fei Meng,et al.  Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Barry V. Hess,et al.  Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis , 2010, HiPC 2010.

[15]  Rajendra Singh,et al.  Performance Driven Partial Checkpoint/Migrate for LAM-MPI , 2008, 2008 22nd International Symposium on High Performance Computing Systems and Applications.

[16]  Chao Wang,et al.  Proactive process-level live migration and back migration in HPC environments , 2012, J. Parallel Distributed Comput..