Design and modeling of a non-blocking checkpointing system

As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. Multi-level checkpointing is a promising solution. However, while multi-level checkpointing is successful on today's machines, it is not expected to be sufficient for exascale class machines, which are predicted to have orders of magnitude larger memory sizes and failure rates. Our solution combines the benefits of non-blocking and multi-level checkpointing. In this paper, we present the design of our system and model its performance. Our experiments show that our system can improve efficiency by 1.1 to 2.0x on future machines. Additionally, applications using our checkpointing system can achieve high efficiency even when using a PFS with lower bandwidth.

[1]  Karsten Schwan,et al.  DataStager: scalable data staging services for petascale applications , 2009, HPDC '09.

[2]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[3]  A. Moody The Scalable Checkpoint/Restart Library , 2009 .

[4]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[5]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[6]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Lustre : A Scalable , High-Performance File System Cluster , 2003 .

[8]  B R de Supinski,et al.  Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System , 2010 .

[9]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[10]  Nitin H. Vaidya Another Two-Level Failure Recovery Scheme , 1994 .

[11]  Robert B. Ross,et al.  Providing Efficient I/O Redundancy in MPI Environments , 2004, PVM/MPI.

[12]  Mahmut T. Kandemir,et al.  Comparative evaluation of overlap strategies with study of I/O overlap in MPI-IO , 2008, OPSR.

[13]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[14]  Mario Lauria,et al.  Improving the Performance of Remote I/O Using Asynchronous Primitives , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[15]  Nitin H. Vaidya,et al.  On Checkpoint Latency , 1995 .

[16]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[17]  Leonid Oliker,et al.  Investigation of leading HPC I/O performance using a scientific-application derived benchmark , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).