Virtual machine based heterogeneous checkpointing

Checkpointing an application is the act of saving the application's state during its execution on stable storage so that if the application fails, it can be restarted from the last saved state, thereby avoiding loss of the work that was already done. A heterogeneous checkpoint/restart mechanism allows to restart an application from a saved state that was taken in a hardware architecture and/or operating system that can be different from those in the machine on which it is restarted. This paper explores how to construct such a mechanism at the virtual machine level. That, is, rather than dumping the entire state of the application process, the mechanism reported here dumps the state of the application w.r.t. a virtual machine. During restart, the saved state is loaded into a new copy of the virtual machine, which continues running from there. The heterogeneous checkpoint/restart mechanism reported here was developed for the OCaml variant of ML. The paper reports on the main issues encountered in building such a mechanism and the design choices made, presents performance evaluations, and discusses some lessons and ideas for extending the work to native code OCaml, and to Java Virtual Machines.

[1]  B. Ramkumar,et al.  Portable checkpointing for heterogeneous architectures , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[2]  Bjarne Steensgaard,et al.  Object and native code thread mobility among heterogeneous computers , 1995, SOSP.

[3]  Roy Friedman,et al.  Symphony: Managing Virtual Servers in the Global Village , 1999, Euro-Par.

[4]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[5]  Michael Litzkow,et al.  Supporting checkpointing and process migration outside the UNIX kernel , 1999 .

[6]  James S. Plank,et al.  An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance , 1997 .

[7]  Danny B. Lange,et al.  Mobile agents with Java: The Aglet API , 1998, World Wide Web.

[8]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[9]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[10]  Sara Bouchenak Making Java Applications Mobile or Persistent , 2001, COOTS.

[11]  Rida A. Bazzi,et al.  Compiler-assisted heterogeneous checkpointing , 2001, Proceedings 20th IEEE Symposium on Reliable Distributed Systems.

[12]  James S. Plank,et al.  Design, implementation, and performance of checkpointing in NetSolve , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[13]  Julia L. Lawall,et al.  Efficient incremental checkpointing of Java programs , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[14]  Andrew S. Grimshaw,et al.  Process Introspection: A Heterogeneous Checkpoint/Restart Mechanism Based on Automatic Code Modification , 1997 .

[15]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[16]  Marvin Theimer,et al.  Heterogeneous process migration by recompilation , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.