An Application Level Approach for Proactive Process Migration in MPI Applications

The running times of large-scale computational science and engineering parallel applications are usually longer than the mean-time-between-failures (MTBF). Hardware failures must be tolerated by the parallel applications to ensure that not all computation done is lost on machine failures. Check pointing and rollback recovery is a very useful technique to implement fault-tolerant applications. However, when a failure occurs, most check pointing mechanisms require a complete restart of the parallel application from the last checkpoint. This affects the efficiency of the solution, leading to an unnecessary overhead that can be avoided through a single process migration in case of failure. Although research has been carried out in this field, the solutions proposed in the literature are commonly tied to specific implementations of the parallel communication APIs or to specific runtime environments. The approach presented in this work extends an application level check pointing framework to proactively migrate MPI processes from processors when impending failures are notified, without having to restart the entire application. The main features of the proposed solution are: transparency for the user, achieved through the use of a compiler tool and a runtime library, and portability since it is not locked into a particular MPI implementation.

[1]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[2]  Alexandru Iosup,et al.  On the dynamic resource availability in grids , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[3]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[4]  Gabriel Rodríguez,et al.  A Heuristic Approach for the Automatic Insertion of Checkpoints in Message-Passing Codes , 2009, J. Univers. Comput. Sci..

[5]  Rajendra Singh,et al.  Performance Driven Partial Checkpoint/Migrate for LAM-MPI , 2008, 2008 22nd International Symposium on High Performance Computing Systems and Applications.

[6]  Franck Cappello,et al.  Checkpointing vs. Migration for Post-Petascale Supercomputers , 2010, 2010 39th International Conference on Parallel Processing.

[7]  Daniel Marques,et al.  Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.

[8]  Laxmikant V. Kalé,et al.  Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.

[9]  John Paul Walters,et al.  Application-Level Checkpointing Techniques for Parallel Programs , 2006, ICDCIT.

[10]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[11]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[12]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[13]  Christian Engelmann,et al.  Proactive Fault Tolerance Using Preemptive Migration , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[14]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[15]  Anand Sivasubramaniam,et al.  Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[16]  Gabriel Rodríguez,et al.  CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications , 2010 .

[17]  Erich Strohmaier,et al.  Linearly scaling 3D fragment method for large-scale electronic structure calculations , 2008, HiPC 2008.

[18]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[19]  Andrew Lumsdaine,et al.  The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[20]  Cong Du,et al.  MPI-Mitten: Enabling Migration Technology in MPI , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[21]  Christian Engelmann,et al.  Proactive process-level live migration in HPC environments , 2008, HiPC 2008.

[22]  Gabriel Rodríguez,et al.  CPPC: a compiler‐assisted tool for portable checkpointing of message‐passing applications , 2010, Concurr. Comput. Pract. Exp..

[23]  Hui Xiong,et al.  Failure Prediction in IBM BlueGene/L Event Logs , 2007, ICDM.