A Case Study of Incremental and Background Hybrid In-Memory Checkpointing

Future exascale computing systems will have high failure rates due to the sheer number of components present i n the system. A classic fault-tolerance technique used in tod ay’s supercomputers is a checkpoint-restart mechanism. Howeve r, traditional hard disk-based checkpointing techniques wil soon hit the scalability wall. Recently, many emerging non-volatile memory technologies , such as Phase-Change RAM (PCRAM), are becoming available and can replace disks with the superior latency and power characteristics. Previous research has demonstrated that taking checkpoints at multiple levels referred to ashybrid checkpointing and employing PCRAM for taking local checkpoints can dramatically reduce checkpoint overhead and has the potential to s cale beyond the exascale. In this work, we develop two prototypes to evaluate hybrid checkpointing. We find that, although global checkpointing is slow, by carefully scheduling checkpointoperations, we can hide its overhead using an extra checkpoint cop y maintained in the local PCRAM of each node. In addition, as local checkpointing gets faster, taking more frequent chec kpoints can help reduce the size of incremental checkpoints. Howeve r, in order to benefit from incremental checkpointing, the checkpoint interval has to be less than 10 seconds.

[1]  Hyun-Wook Jin,et al.  High performance MPI-2 one-sided communication over InfiniBand , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[2]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[3]  Shih-Hung Chen,et al.  Phase-change random access memory: A scalable technology , 2008, IBM J. Res. Dev..

[4]  W. Kent Fuchs,et al.  Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[5]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[6]  Seetharami R. Seelam,et al.  Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[7]  Yuan Xie,et al.  Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[8]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[9]  Yookun Cho,et al.  Space-efficient page-level incremental checkpointing , 2005, SAC '05.

[10]  Hua Zhong,et al.  CRAK: Linux Checkpoint/Restart As a Kernel Module , 1996 .

[11]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[12]  James S. Plank,et al.  Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.

[13]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[14]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[15]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[16]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[17]  Kai Li,et al.  Memory Exclusion: Optimizing the Performance of Checkpointing Systems , 1999, Softw. Pract. Exp..

[18]  Daniel Marques,et al.  C3: A System for Automating Application-Level Checkpointing of MPI Programs , 2003, LCPC.

[19]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .