CLIP: A Checkpointing Tool for Message Passing Parallel Programs

Checkpointing is a useful technique for rollback recovery. We present CLIP, a user-level library that provides semi-transparent checkpointing for parallel programs on the Intel Paragon multicomputer. Creating an actual tool for checkpointing a complex machine like the Paragon is not easy, because many issues arise that require careful design decisions to be made. We detail what these decisions are, and how they were made in CLIP. We present performance data when checkpointing several long-running parallel applications. These results show that a convenient, general-purpose checkpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer with good performance.

[1]  Willy Zwaenepoel,et al.  On the use and implementation of message logging , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[2]  Partha Dasgupta,et al.  CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms , 1995, Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing.

[3]  Kai Li,et al.  Libckpt: Transparent Checkpointing under Unix Error Correction: Libckpt: Transparent Checkpointing under Unix , 1995 .

[4]  Ian Foster,et al.  Parallel Spectral Transform Shallow Water Model: a runtime-tunable parallel benchmark code , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[5]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[6]  D.A. Reed,et al.  Input/Output Characteristics of Scalable Parallel Applications , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[7]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[8]  BeguelinAdam,et al.  Application Level Fault Tolerance in Heterogeneous Networks of Workstations , 1997 .

[9]  Tzi-cker Chiueh,et al.  Evaluation of checkpoint mechanisms for massively parallel machines , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[10]  Yennun Huang,et al.  Software Implemented Fault Tolerance Technologies and Experience , 1993, FTCS.

[11]  George Em Karniadakis,et al.  Unstructured spectral element methods for simulation of turbulent flows , 1995 .

[12]  P. Pierce,et al.  The Paragon implementation of the NX message passing interface , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[13]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[14]  Micah Beck,et al.  Compiler-Assisted Memory Exclusion for Fast Checkpointing , 1995 .

[15]  C. R. Landau The checkpoint mechanism in KeyKOS , 1992, [1992] Proceedings of the Second International Workshop on Object Orientation in Operating Systems.

[16]  Jonathan Walpole,et al.  MIST: PVM with Transparent Migration and Checkpointing , 1995 .

[17]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[18]  Stuart I. Feldman,et al.  IGOR: a system for program debugging via reversible execution , 1988, PADD '88.

[19]  I FeldmanStuart,et al.  IGOR: a system for program debugging via reversible execution , 1988 .

[20]  M. Moura Silva,et al.  Checkpointing SPMD applications on transputer networks , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[21]  Mark Russinovich,et al.  Fault-tolerance for off-the-shelf applications and hardware , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[22]  Kai Li,et al.  ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.