Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifi- cally designed to provide fault tolerance in Linux clusters. This implementation, based on the 2.6.11 Linux kernel, provides the essential functionality for transparent, highly responsive, and efficient fault tolerance based on full or incremental checkpointing at system level. TICK is completely user-transparent and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5µs; and it supports incremental and full checkpoints with minimal overhead-less than 6% with full checkpointing to disk performed as frequently as once per minute.

[1]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[2]  A. Goscinski,et al.  Exploiting operating system services to efficiently checkpoint parallel applications in GENESIS , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..

[3]  Hua Zhong,et al.  CRAK: Linux Checkpoint/Restart As a Kernel Module , 1996 .

[4]  Jason Nieh,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation , 2022 .

[5]  Fabrizio Petrini,et al.  System-level fault-tolerance in large-scale parallel machines with buffered coscheduling , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[6]  Daniel Marques,et al.  Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[7]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[8]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[9]  Fabrizio Petrini,et al.  Designing Parallel Operating Systems via Parallel Programming , 2004, Euro-Par.

[10]  Dror G. Feitelson,et al.  User-level communication in a system with gang scheduling , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[11]  Fabrizio Petrini,et al.  Architectural support for system software on large-scale clusters , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..

[12]  Fabrizio Petrini,et al.  Predictive Performance and Scalability Modeling of a Large-Scale Application , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[13]  Amnon Barak,et al.  The MOSIX multicomputer operating system for high performance cluster computing , 1998, Future Gener. Comput. Syst..

[14]  Peter K. Szwed,et al.  Application-level checkpointing for shared memory programs , 2004, ASPLOS XI.

[15]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[16]  Fabrizio Petrini,et al.  On the feasibility of incremental checkpointing for scientific computing , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[17]  Barton P. Miller,et al.  Process migration in DEMOS/MP , 1983, SOSP '83.

[18]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[19]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[20]  Erik A. Hendriks,et al.  BProc: the Beowulf distributed process space , 2002, ICS '02.

[21]  Fabrizio Petrini,et al.  BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers , 2003, SC.

[22]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[23]  Wu-chun Feng,et al.  IMPROVED RESOURCE UTILIZATION WITH BUFFERED COSCHEDULING , 2001, Parallel Algorithms Appl..

[24]  Scott Pakin,et al.  STORM: Lightning-Fast Resource Management , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[25]  F. Petrini,et al.  BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[26]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..