Fault-tolerant solutions for a MPI compute intensive application

The running times of large-scale computational science and engineering parallel applications, executed on clusters or grid platforms, are usually longer than the mean-time-between-failures (MTBF). Hardware failures must be tolerated by the parallel applications to ensure that no all computation done is lost on machine failures. Checkpointing and rollback recovery is a very useful technique to implement fault-tolerant applications. Although extensive research has been carried out in this field, there are few available tools to help parallel programmers to enhance with fault tolerant capability their applications. This work presents two different approaches to endow with fault tolerance the MPI version of an air quality simulation. A segment-level solution has been implemented by means of the extension of a checkpointing library for sequential codes. A variable-level solution has been implemented manually in the code. The main differences between both approaches are portability, transparency-level and checkpointing overheads. Experimental results comparing both strategies on a cluster of PCs are shown in the paper

[1]  Rohit Mathur,et al.  The stem-II regional-scale acid deposition and photochemical oxidant model. III: A study of mesoscale acid deposition in the lower Ohio river Valley , 1989 .

[2]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[3]  Heon Y. Yeom,et al.  MPICH-GF: Providing Fault Tolerance on Grid Environments , 2003 .

[4]  Gabriel Rodríguez,et al.  Controller/Precompiler for Portable Checkpointing , 2006, IEICE Trans. Inf. Syst..

[5]  Daniel Marques,et al.  Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.

[6]  Gregory R. Carmichael,et al.  The STEM-II regional scale acid deposition and photochemical oxidant model—I. An overview of model development and applications , 1991 .

[7]  Jonathan Robinson,et al.  The Hector Distributed Run-Time Environment , 1998, IEEE Trans. Parallel Distributed Syst..

[8]  Adrianos Lachanas,et al.  MPI-FT: Portable Fault Tolerance Scheme for MPI , 2000, Parallel Process. Lett..

[9]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[10]  J. Carlos Mouri,et al.  HIGH PERFORMANCE AIR QUALITY SIMULATION IN THE EUROPEAN CROSSGRID PROJECT , 2006 .

[11]  Harrick M. Vin,et al.  Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[12]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[13]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[14]  Lorenzo Alvisi,et al.  Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[15]  15th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP 2007), 7-9 February 2007, Naples, Italy , 2007, PDP.

[16]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[17]  Javier D. Bruguera,et al.  High performance air pollution modeling for a power plant environment , 2003, Parallel Comput..