User-Level Socket-Based Checkpointing for Distributed and Parallel Computation

We present a preliminary description of a user-level checkpointing package, DMTCP, for Linux. The socket-based approach presents a novel method for checkpointing distributed processes. This includes checkpointing of any dynamically created POSIX threads and forked child processes. It also includes checkpointing of remotel y spawned processes via ssh and other mechanisms. As with all user-level checkpointing, no modification of the kernel is needed, and the application code is not modified. The package also checkpoints signal handlers, ordinary file descriptors, socket descriptors, and c ertain other types of file descriptors. Each checkpointed process has an associated checkpoint file . Hence, process migration, and even migration of an entire computation to a new cluster, are achieved through the simple expedient of copying checkpoint files to a new host. However, process migration adds the additional restriction that the source and destination hos t must be homogeneous.

[1]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[2]  Peter M. A. Sloot,et al.  The implementation of dynamite: an environment for migrating PVM tasks , 2000, OPSR.

[3]  Gene Cooperman,et al.  Transparent adaptive library-based checkpointing for master-worker style parallelism , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[4]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[5]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[6]  Gene Cooperman,et al.  Transparent User-Level Checkpointing for the Native Posix Thread Library for Linux , 2006, PDPTA.

[7]  Jason Nieh,et al.  Transparent Checkpoint-Restart of Distributed Applications on Commodity Clusters , 2005, 2005 IEEE International Conference on Cluster Computing.

[8]  William Stein,et al.  SAGE: Software for Algebra and Geometry Experimentation , 2006 .

[9]  Peter K. Szwed,et al.  Application-level checkpointing for shared memory programs , 2004, ASPLOS XI.

[10]  Daniel Marques,et al.  Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.

[11]  Miron Livny,et al.  Process hijacking , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[12]  John K. Bennett,et al.  Efficient user-level thread migration and checkpointing on windows NT clusters , 1999 .

[13]  Wenguang Chen,et al.  Thckpt: Transparent Checkpointing of Linux Processes Under IA-64 , 2005, PDPTA.

[14]  Weimin Zheng,et al.  User-level checkpoint and recovery for LAM/MPI , 2005, OPSR.

[15]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[16]  Andrea Clematis,et al.  CPVM-extending PVM for consistent checkpointing , 1996, Proceedings of 4th Euromicro Workshop on Parallel and Distributed Processing.

[17]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[18]  Deron Liang,et al.  Winckp: a transparent checkpointing and rollback recovery tool for Windows NT applications , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[19]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[20]  Jonathan Walpole,et al.  A Migratable User-Level Process Package for PVM , 1997, J. Parallel Distributed Comput..

[21]  Hovav Shacham,et al.  On the effectiveness of address-space randomization , 2004, CCS '04.

[22]  Ronald Minnich,et al.  A Network-Failure-Tolerant Message-Passing System for Terascale Clusters , 2002, ICS '02.

[23]  Barton P. Miller,et al.  Checkpoints of GUI-based Applications , 2003, USENIX Annual Technical Conference, General Track.

[24]  William R. Dieter,et al.  User-Level Checkpointing for LinuxThreads Programs , 2001, USENIX Annual Technical Conference, FREENIX Track.