Collective Operations in an Application-level Fault Tolerant MPI System

The running times of many computational science programs are now significantly greater than the mean-time-betweenfailures (MTBF) of the hardware they run on. Therefore, fault-tolerance is becoming a critical issue on highperformance platforms. Checkpointing is a technique for making programs fault tolerant by periodically saving their state and restoring this state after failure. In system-level checkpointing, the state of the entire machine is saved periodically on stable storage. This has too much overhead to be practical on highperformance platforms with thousands of processors. In practice, programmers do manual checkpointing by writing code to (i) save the values of key program variables at critical points in the program, and (ii) restore the entire computational state from these values during recovery. However, this can be difficult to do in general MPI programs. In an earlier paper, we presented a distributed checkpoint coordination protocol which handles MPI’s point-topoint constructs, and deals with the unique challenges of application-level checkpointing. This protocol is implemented by a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library. However, it did not handle collective communication, which is a very important part of MPI. In this paper we extend the protocol to handle MPI’s collective communication constructs.

[1]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[2]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[3]  Daniel Marques,et al.  Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.

[4]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[5]  Greg J. Regnier,et al.  The Virtual Interface Architecture , 2002, IEEE Micro.

[6]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[7]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[8]  Harrick M. Vin,et al.  Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).