Reducing the overhead of an MPI application-level migration approach

Solution to reduce memory and I/O overhead in a checkpoint-based migration tool.The proposal splits the checkpoint files in small chunks and overlaps I/O phases.Results prove its efficiency, both in terms of memory consumption and I/O times.Reduction in memory usage enables migration in applications with large state files. Process migration provides many benefits for parallel environments including dynamic load balance, data access locality, or fault tolerance. This work proposes a solution that reduces the memory and I/O overhead in an application-level checkpoint-based migration approach. The proposal splits the checkpoint files in order to overlap the writing of the state in the terminating processes with the read and restarting operation in the newly spawned processes. It has been tested using the MPI NAS Parallel Benchmarks, showing encouraging results, both in terms of memory consumption and I/O migration times.

[1]  Dhabaleswar K. Panda,et al.  RDMA-Based Job Migration Framework for MPI over InfiniBand , 2010, 2010 IEEE International Conference on Cluster Computing.

[2]  Roberto R. Osorio,et al.  Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes , 2013, New Generation Computing.

[3]  Gabriel Rodríguez,et al.  CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications , 2010, Concurr. Comput. Pract. Exp..

[4]  Dhabaleswar K. Panda,et al.  High Performance Pipelined Process Migration with RDMA , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[5]  Patricia González,et al.  Improving an MPI Application-Level Migration Approach through Checkpoint File Splitting , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.

[6]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[7]  Gabriel Rodríguez,et al.  CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications , 2010 .

[8]  Gabriel Rodríguez,et al.  In-memory application-level checkpoint-based migration for MPI programs , 2014, The Journal of Supercomputing.

[9]  Laxmikant V. Kalé,et al.  Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.

[10]  Kai Li,et al.  ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[11]  Gabriel Rodríguez,et al.  A Heuristic Approach for the Automatic Insertion of Checkpoints in Message-Passing Codes , 2009, J. Univers. Comput. Sci..

[12]  Rajendra Singh,et al.  Performance Driven Partial Checkpoint/Migrate for LAM-MPI , 2008, 2008 22nd International Symposium on High Performance Computing Systems and Applications.

[13]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[14]  Chung-Chi Jim Li,et al.  CATCH - Compiler-Assisted Techniques for Checkpointing , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[15]  Cong Du,et al.  MPI-Mitten: Enabling Migration Technology in MPI , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[16]  Christian Engelmann,et al.  Proactive process-level live migration in HPC environments , 2008, HiPC 2008.

[17]  Gabriel Rodríguez,et al.  Analysis of Performance-impacting Factors on Checkpointing Frameworks: The CPPC Case Study , 2011, Comput. J..

[18]  Christian Engelmann,et al.  Proactive Fault Tolerance Using Preemptive Migration , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[19]  Minyi Guo,et al.  Process migration for MPI applications based on coordinated checkpoint , 2005, 11th International Conference on Parallel and Distributed Systems (ICPADS'05).

[20]  Cong Du,et al.  HPCM: a pre-compiler aided middleware for the mobility of legacy code , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[21]  Chao Wang,et al.  Proactive process-level live migration and back migration in HPC environments , 2012, J. Parallel Distributed Comput..

[22]  Thomas J. Hacker,et al.  Secure live migration of parallel applications using container-based virtual machines , 2012, Int. J. Space Based Situated Comput..