Portable Checkpointing and Recovery in Heterogeneous Environments

Current approaches for checkpointing and recovery assume system homogeneity, where checkpointing and recovery are both performed on the sameprocessor architecture and operating system configuration. Sometimes it is desirable or necessary to recover the failed computation on a different processor architecture, with possibly diffe rent byte-ordering and data-alignment specifications. This implies that checkpointing and recovery must be portable. We provide portability by means of a universal checkpoint format that allows object codes to resume execution from a checkpointed state, allowing for fast execution of already compiled code, rather than interpreting or compiling on the fly. This paper describes the system support needed to implement portable checkpoints , and the shadow checkpoint algorithmto checkpoint and recover a sequential process. Experimental results on three differe nt architecture-operating system combinations demonstrate the checkpointing overhead and the cost of recovery.

[1]  Henri E. Bal,et al.  Transparent fault-tolerance in parallel Orca programs , 1992 .

[2]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[3]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[4]  Andrew W. Appel,et al.  Garbage Collection can be Faster than Stack Allocation , 1987, Inf. Process. Lett..

[5]  James A. Gosling,et al.  The java language environment: a white paper , 1995 .

[6]  Volker Strumpen,et al.  Software-based communication latency hiding for commodity workstation networks , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[7]  Miguel Castro,et al.  A checkpoint protocol for an entry consistent shared memory system , 1994, PODC '94.

[8]  Sean W. Smith,et al.  Completely asynchronous optimistic recovery with minimal rollbacks , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[9]  Erik Seligman,et al.  High-Level Fault Tolerance in Distributed Programs , 1994 .

[10]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[11]  Michael Steffen Oliver Franz,et al.  Code_generation On_the_fly: a Key to Portable Software , 1994 .

[12]  A. P. Wood An analysis of client/server outage data , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[13]  W. Kent Fuchs,et al.  Compiler‐assisted full checkpointing , 1994, Softw. Pract. Exp..

[14]  Jeffrey F. Naughton,et al.  Short Notes Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994 .

[15]  Monica S. Lam,et al.  Transparent Fault Tolerance for Parallel Applications on Networks of Workstations , 1996, USENIX Annual Technical Conference.

[16]  Dhiraj K. Pradhan,et al.  Virtual Checkpoints: Architecture and Performance , 1992, IEEE Trans. Computers.

[17]  David B. Johnson,et al.  Efficient transparent optimistic rollback recovery for distributed application programs , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[18]  James S. Plank Efficient checkpointing on MIMD architectures , 1993 .

[19]  Mark E. Staknis,et al.  Sheaved memory: architectural support for state saving and restoration in pages systems , 1989, ASPLOS III.

[20]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[21]  Robert D. Blumofe,et al.  Executing multithreaded programs efficiently , 1995 .