Metamori: A library for Incremental File Checkpointing

The advent of cluster computing has resulted in a thrust towards providing software mechanisms for reliability on clusters. The prevalent model for such mechanisms is to take a snapshot of the state of an application, called a checkpoint and commit it to stable storage. This checkpoint has sufficient meta-data, so that if the application fails, it can be restarted from the checkpoint. This operation is called a restore. In order to record a process’ complete state, both its volatile and persistent state must be checkpointed. Several libraries exist for checkpointing volatile state. Some of these libraries feature incremental checkpointing, where only the changes since the last checkpoint are recorded in the next checkpoint. Such incremental checkpointing is advantageous since otherwise, the time taken for each successive checkpoint becomes larger and larger. Also, when checkpointing is done in increments, we can restore state to any of the previous checkpoints; a vital feature for adaptive applications. This thesis presents a user-level incremental checkpointing library for files: Metamori. This brings the advantages of incremental memory checkpointing to files as well, thereby providing a lowoverhead approach to checkpoint persistent state. Thus, the complete state of an application can now be incrementally checkpointed, as compared to earlier approaches where volatile state was checkpointed incrementally but persistent state had no such facilities.

[1]  Hua Zhong,et al.  CRAK: Linux Checkpoint/Restart As a Kernel Module , 1996 .

[2]  Michael Litzkow,et al.  Supporting checkpointing and process migration outside the UNIX kernel , 1999 .

[3]  Luís Moura Silva,et al.  System-level versus user-defined checkpointing , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[4]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[5]  Craig A. Knoblock,et al.  Advanced Programming in the UNIX Environment , 1992, Addison-Wesley professional computing series.

[6]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[7]  Dan Pei,et al.  Design and implementation of a low-overhead file checkpointing approach , 2000, Proceedings Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region.

[8]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[9]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[10]  Maurice J. Bach The Design of the UNIX Operating System , 1986 .

[11]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[12]  S. Yajnik,et al.  Checkpointing in CosMiC: a user-level process migration environment , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[13]  Kai Li,et al.  Memory Exclusion: Optimizing the Performance of Checkpointing Systems , 1999, Softw. Pract. Exp..

[14]  Dan Pei,et al.  Modification Operation Buffering : A Low-Overhead Approach to Checkpoint User Files , 1999 .

[15]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[16]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[17]  Peter M. A. Sloot,et al.  The implementation of dynamite: an environment for migrating PVM tasks , 2000, OPSR.