Transparent Redundant Computing with MPI

Extreme-scale parallel systems will require alternative methods for applications to maintain current levels of uninterrupted execution. Redundant computation is one approach to consider, if the benefits of increased resiliency outweigh the cost of consuming additional resources. We describe a transparent redundancy approach for MPI applications and detail two different implementations that provide the ability to tolerate a range of failure scenarios, including loss of application processes and connectivity. We compare these two approaches and show performance results from micro-benchmarks that bound worstcase message passing performance degradation. We propose several enhancements that could lower the overhead of providing resiliency through redundancy.

[1]  Andrew Lumsdaine,et al.  The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[2]  William Gropp,et al.  Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..

[3]  Emilio Luque,et al.  Euro-Par 2008 - Parallel Processing, 14th International Euro-Par Conference, Las Palmas de Gran Canaria, Spain, August 26-29, 2008, Proceedings , 2008, Euro-Par.

[4]  Jesús Labarta,et al.  Scaling MPI to short-memory MPPs such as BG/L , 2006, ICS '06.

[5]  Rolf Riesen,et al.  See applications run and throughput jump: The case for redundant computing in HPC , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[6]  Emmanuel Jeannot,et al.  Fault-Management in P2P-MPI , 2009, International Journal of Parallel Programming.

[7]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[8]  Zhiling Lan,et al.  Reliability-aware scalability models for high performance computing , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[9]  James H. Laros,et al.  rMPI : increasing fault resiliency in a message-passing environment. , 2011 .

[10]  Stéphane Genaud,et al.  P2P-MPI: A Peer-to-Peer Framework for Robust Execution of Message Passing Parallel Programs on Grids , 2007, Journal of Grid Computing.

[11]  Xin Chen,et al.  Symmetric active/active metadata service for high availability parallel file systems , 2009, J. Parallel Distributed Comput..

[12]  Jack J. Dongarra,et al.  FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World , 2000, PVM/MPI.

[13]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[14]  Emilio Luque,et al.  Providing Non-stop Service for Message-Passing Based Parallel Applications with RADIC , 2008, Euro-Par.