论文信息 - Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI. Application-level solutions, including both checkpointing and fault-tolerant algorithms, are recognized as more time and space efficient than system-level checkpoints, which cannot make use of any application-specific knowledge. However, system-level checkpointing allows for preemption, making it suitable for responding to ''fault precursors'' (for instance, elevated error rates from ECC memory or network CRCs, or elevated temperature from sensors). Preemption can also increase the efficiency of batch scheduling; for instance reducing idle cycles (by allowing for shutdown without any queue draining period or reallocation of resources to eliminate idle nodes when better fitting jobs are queued), and reducing the average queued time (by limiting large jobs to running during off-peak hours, without the need to limit the length of such jobs). Each of these potential uses makes BLCR a valuable tool for efficient resource management in Linux clusters.

Jason Duell | Paul Hargrove | J. Duell | Paul H. Hargrove

[1] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[2] Jason Nieh,et al. The design and implementation of Zap: a system for migrating computing environments , 2002, OSDI '02.

[3] Rong Zeng,et al. The Design and Implementation of , 2002 .

[4] Eric Roman. A Survey of Checkpoint / Restart Implementations , 2002 .

[5] Hua Zhong,et al. CRAK: Linux Checkpoint/Restart As a Kernel Module , 1996 .

[6] James H. Laros,et al. Scalable system software: a component-based approach , 2005 .

[7] Jason Duell,et al. Requirements for Linux Checkpoint/Restart , 2002 .

[8] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[9] Erik A. Hendriks,et al. BProc: the Beowulf distributed process space , 2002, ICS '02.