A Survey of Checkpoint / Restart Implementations

In this paper we evaluate candidates for a checkpoint/restart implementation against a common set of requirements. Overall characteristics of the two main classes of checkpoint systems, library and system, are discussed followed by specific examples from existing systems. A detailed description of two system implementations is presented. We conclude that no single publically available implementation meets all requirements for a checkpoint/restart system for Linux clusters.

[1]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[2]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[3]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[4]  Hua Zhong,et al.  CRAK: Linux Checkpoint/Restart As a Kernel Module , 1996 .

[5]  Hiroshi Harada,et al.  PM2: High Performance Communication Middleware for Heterogeneous Network Environments , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[6]  William R. Dieter,et al.  User-Level Checkpointing for LinuxThreads Programs , 2001, USENIX Annual Technical Conference, FREENIX Track.

[7]  Carsten Franke,et al.  Job Scheduling Strategies for Parallel Processing , 2002, Lecture Notes in Computer Science.

[8]  Barton P. Miller,et al.  Reliable network connections , 2002, MobiCom '02.