Checkpointing and its applications

The paper describes our experience with the implementation and applications of the Unix checkpointing library libckp, and identifies two concepts that have proven to be the key to making checkpointing a powerful tool. First, including all persistent states, i.e., user files, as part of the process state that can be checkpointed and recovered provides a truly transparent and consistent rollback. Second, excluding part of the persistent state from the process state allows user programs to process future inputs from a desirable state, which leads to interesting new applications of checkpointing. We use real-life examples to demonstrate the use of libckp for bypassing premature software exits, for fast initialization and for memory rejuvenation.<<ETX>>

[1]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[2]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[3]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[4]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[5]  Hans-Juergen Boehm,et al.  Garbage collection in an uncooperative environment , 1988, Softw. Pract. Exp..

[6]  R.E. Strom,et al.  A recoverable object store , 1988, [1988] Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences. Volume II: Software track.

[7]  Kurt Keutzer,et al.  Algorithms and Techniques for VLSI Layout and Synthesis , 1988 .

[8]  Paul Ammann,et al.  Data Diversity: An Approach to Software Fault Tolerance , 1988, IEEE Trans. Computers.

[9]  Fred Douglis,et al.  Transparent process migration: Design alternatives and the sprite implementation , 1991, Softw. Pract. Exp..

[10]  Robert O. Hastings,et al.  Fast detection of memory leaks and access errors , 1991 .

[11]  Stefan Buczacki The Essential Gardener , 1991 .

[12]  Ibrahim N. Hajj,et al.  Diagnosis and Correction of Logic Design Errors in Digital Circuits , 1993, 30th ACM/IEEE Design Automation Conference.

[13]  W. Kent Fuchs,et al.  Lazy checkpoint coordination for bounding rollback propagation , 1992, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[14]  David G. Korn,et al.  Feature-based portability , 1994 .

[15]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[16]  Ibrahim N. Hajj,et al.  Pattern independent maximum current estimation in power and ground buses of CMOS VLSI circuits: Algorithms, signal correlations, and their resolution , 1995, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[17]  Carl Sechen,et al.  Efficient and effective placement for very large circuits , 1995, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[18]  Weitong Chuang,et al.  Timing and area optimization for standard-cell VLSI circuit design , 1995, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..