In-memory application-level checkpoint-based migration for MPI programs

Process migration provides many benefits for parallel environments including dynamic load balancing, data access locality or fault tolerance. This paper describes an in-memory application-level checkpoint-based migration solution for MPI codes that uses the Hierarchical Data Format 5 (HDF5) to write the checkpoint files. The main features of the proposed solution are transparency for the user, achieved through the use of CPPC (ComPiler for Portable Checkpointing); portability, as the application-level approach makes the solution adequate for any MPI implementation and operating system, and the use of the HDF5 file format enables the restart on different architectures; and high performance, by saving the checkpoint files to memory instead of to disk through the use of the HDF5 in-memory files. Experimental results prove that the in-memory approach reduces significantly the I/O cost of the migration process.

[1]  Cong Du,et al.  MPI-Mitten: Enabling Migration Technology in MPI , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[2]  Chao Wang,et al.  Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Gabriel Rodríguez,et al.  Reducing Application-level Checkpoint File Sizes: Towards Scalable Fault Tolerance Solutions , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.

[4]  Rajendra Singh,et al.  Performance Driven Partial Checkpoint/Migrate for LAM-MPI , 2008, 2008 22nd International Symposium on High Performance Computing Systems and Applications.

[5]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[6]  Fei Meng,et al.  Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Gabriel Rodríguez,et al.  CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications , 2010, Concurr. Comput. Pract. Exp..

[8]  Dhabaleswar K. Panda,et al.  High Performance Pipelined Process Migration with RDMA , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[9]  Gabriel Rodríguez,et al.  Failure Avoidance in MPI Applications Using an Application-Level Approach , 2014, Comput. J..