Controller/Precompiler for Portable Checkpointing

This paper presents CPPC (Controller/Precompiler for Portable Checkpointing), a checkpointing tool designed for heterogeneous clusters and Grid infrastructures through the use of portable protocols, portable checkpoint files and portable code. It works at variable level being user-directed, thus generating small checkpoint files. It allows parallel processes to checkpoint independently, without runtime coordination or message-logging. Consistency is achieved at restart time by negotiating the restart point. A directive-based checkpointing precompiler has also been implemented to ease up user's effort. CPPC was designed to work with parallel MPI programs, though it can be used with sequential ones, and easily extended to parallel programs written using different message-passing libraries, due to its highly modular design. Experimental results are shown using CPPC with different test applications.

[1]  Sunil Ahn,et al.  PC/MPI: Desing and Implementation of a Portable MPI Checkpointer , 2003, PVM/MPI.

[2]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[3]  William Gropp,et al.  Users guide for mpich, a portable implementation of MPI , 1996 .

[4]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[5]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[6]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[7]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[8]  Michel Raynal,et al.  Consistency Issues in Distributed Checkpoints , 1999, IEEE Trans. Software Eng..

[9]  BeguelinAdam,et al.  Application Level Fault Tolerance in Heterogeneous Networks of Workstations , 1997 .

[10]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[11]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[12]  Heon Young Yeom,et al.  MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes , 2004, IEICE Trans. Inf. Syst..

[13]  Erik Seligman,et al.  Dome: Distributed Object Migration Environment , 1994 .

[14]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[15]  Daniel Marques,et al.  Collective operations in application-level fault-tolerant MPI , 2003, ICS '03.

[16]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[17]  Daniel Marques,et al.  Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.