Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems

We present a new distributed checkpoint-restart mechanism, Cruz, that works without requiring application, library, or base kernel modifications. This mechanism provides comprehensive support for checkpointing and restoring application state, both at user level and within the OS. Our implementation builds on Zap, a process migration mechanism, implemented as a Linux kernel module, which operates by interposing a thin layer between applications and the OS. In particular, we enable support for networked applications by adding migratable IP and MAC addresses, and checkpoint-restart of socket buffer state, socket options, and TCP state. We leverage this capability to devise a novel method for coordinated checkpoint-restart that is simpler than prior approaches. For instance, it eliminates the need to flush communication channels by exploiting the packet re-transmission behavior of TCP and existing OS support for packet filtering. Our experiments show that the overhead of coordinating checkpoint-restart is negligible, demonstrating the scalability of this approach.

[1]  Peter Alan Lee,et al.  Fault Tolerance , 1990, Dependable Computing and Fault-Tolerant Systems.

[2]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[3]  Willy Zwaenepoel,et al.  On the use and implementation of message logging , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[4]  Pankaj Jalote,et al.  Fault tolerance in distributed systems , 1994 .

[5]  W. Richard Stevens,et al.  TCP/IP Illustrated, Volume 1: The Protocols , 1994 .

[6]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[7]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[8]  Jonathan Walpole,et al.  MPVM: A Migration Transparent Version of PVM , 1995, Comput. Syst..

[9]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[10]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[11]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[12]  Nuno Neves,et al.  RENEW: a tool for fast and efficient implementation of checkpoint protocols , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[13]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[14]  Jason Nieh,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation , 2022 .

[15]  Ian T. Foster,et al.  Grid Services for Distributed System Integration , 2002, Computer.

[16]  Jason Duell,et al.  Requirements for Linux Checkpoint/Restart , 2002 .

[17]  Jeffrey C. Mogul,et al.  Unveiling the transport , 2004, CCRV.

[18]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[19]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..