Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI

A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPI has led to the development of several fault tolerant MPI environments. Different approaches are being proposed using a variety of fault tolerant message passing protocols based on coordinated checkpointing or message logging. The most popular approach is with coordinated checkpointing. In the literature, two different concepts of coordinated checkpointing have been proposed: blocking and non-blocking. However they have never been compared quantitatively and their respective scalability remains unknown. The contribution of this paper is to provide the first comparison between these two approaches and a study of their scalability. We have implemented the two approaches within the MPICH environments and evaluate their performance using the NAS parallel benchmarks

[1]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[2]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[3]  Franck Cappello,et al.  Grid'5000: a large scale and highly reconfigurable grid experimental testbed , 2005, The 6th IEEE/ACM International Workshop on Grid Computing, 2005..

[4]  Daniel Marques,et al.  Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.

[5]  Charng-Da Lu,et al.  Reliability challenges in large systems , 2006, Future Gener. Comput. Syst..

[6]  Franck Cappello,et al.  Grid'5000: a large scale, reconfigurable, controlable and monitorable Grid platform , 2005 .

[7]  Daniel Marques,et al.  Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[8]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[9]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[10]  Achour Mostéfaoui,et al.  Communication-Induced Determination of Consistent Snapshots , 1999, IEEE Trans. Parallel Distributed Syst..

[11]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[12]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[13]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[14]  Thomas Hérault,et al.  Improved message logging versus improved coordinated checkpointing for fault tolerant MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[15]  Guillaume Mercier,et al.  Implementation and Shared-Memory Evaluation of MPICH2 over the Nemesis Communication Subsystem , 2006, PVM/MPI.

[16]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[17]  Armin R. Mikler,et al.  NetPIPE: A Network Protocol Independent Performance Evaluator , 1996 .

[18]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[19]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[20]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[21]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[22]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[23]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[24]  Greg Burns,et al.  LAM: An Open Cluster Environment for MPI , 2002 .