Distributed Diskless Checkpoint for Large Scale Systems

In high performance computing (HPC), the applications are periodically check pointed to stable storage to increase the success rate of long executions. Nowadays, the overhead imposed by disk-based checkpoint is about 20% of execution time and in the next years it will be more than 50% if the checkpoint frequency increases as the fault frequency increases. Diskless checkpoint has been introduced as a solution to avoid the IO bottleneck of disk-based checkpoint. However, the encoding time, the dedicated resources (the spares) and the memory overhead imposed by diskless checkpoint are significant obstacles against its adoption. In this work, we address these three limitations: 1) we propose a fault tolerant model able to tolerate up to 50% of process failures with a low check pointing overhead 2) our fault tolerance model works without spare node, while still guarantying high reliability, 3) we use solid state drives to significantly increase the checkpoint performance and avoid the memory overhead of classic diskless checkpoint.

[1]  Robert B. Ross,et al.  Providing Efficient I/O Redundancy in MPI Environments , 2004, PVM/MPI.

[2]  Lihao Xu,et al.  Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Network Storage Applications , 2006, Fifth IEEE International Symposium on Network Computing and Applications (NCA'06).

[3]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[4]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[5]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[6]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[7]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[8]  Rong Zeng,et al.  The Design and Implementation of , 2002 .

[9]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[10]  Satoshi Matsuoka The Road to TSUBAME and Beyond , 2008 .

[11]  Ahmed Al-Nazer,et al.  On Disk-based and Diskless Checkpointing for Parallel and Distributed Systems: An Empirical Analysis , 2005 .

[12]  Eric Roman A Survey of Checkpoint / Restart Implementations , 2002 .

[13]  IEEE Transactions on Parallel and Distributed Systems, Vol. 13 , 2002 .

[14]  Zizhong Chen,et al.  Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[15]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[16]  G Bronevetsky,et al.  Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/sec File I/O , 2009 .

[17]  Dhabaleswar K. Panda,et al.  Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[18]  Charng-da Lu,et al.  Scalable Diskless Checkpointing for Large Parallel Systems , 2005 .

[19]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[20]  Zizhong Chen,et al.  A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing , 2008, 2008 11th IEEE High Assurance Systems Engineering Symposium.

[21]  Eduardo Pinheiro,et al.  DRAM errors in the wild , 2011, Commun. ACM.

[22]  Anthony Skjellum,et al.  Accelerating Reed-Solomon coding in RAID systems with GPUs , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[23]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[24]  Yuan Xie,et al.  Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[25]  Catherine D. Schuman,et al.  A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage , 2009, FAST.