FTI: High performance Fault Tolerance Interface for hybrid systems

Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in which we integrate a highly-reliable topology-aware Reed-Solomon encoding in a three-level checkpoint scheme. We efficiently hide the encoding time using one Fault-Tolerance dedicated thread per node. We implement our technique in the Fault Tolerance Interface FTI. We evaluate the correctness of our performance model and conduct a study of the reliability of our library. To demonstrate the performance of FTI, we present a case study of the Mw9.0 Tohoku Japan earthquake simulation with SPECFEM3D on TSUBAME2.0. We demonstrate a checkpoint overhead as low as 8% on sustained 0.1 petaflops runs (1152 GPUs) while check-pointing at high frequency.

[1]  Charng-da Lu,et al.  Scalable Diskless Checkpointing for Large Parallel Systems , 2005 .

[2]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[3]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[4]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[5]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[6]  Yuan Xie,et al.  Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[7]  Catherine D. Schuman,et al.  A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage , 2009, FAST.

[8]  Chen Ji,et al.  A 14.6 billion degrees of freedom, 5 teraflops, 2.5 terabyte earthquake simulation on the Earth Simulator , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[9]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[10]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[11]  Gordon Erlebacher,et al.  Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA , 2009, J. Parallel Distributed Comput..

[12]  Gordon Erlebacher,et al.  High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster , 2010, J. Comput. Phys..

[13]  Eric Roman A Survey of Checkpoint / Restart Implementations , 2002 .

[14]  Rong Zeng,et al.  The Design and Implementation of , 2002 .

[15]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[16]  Seetharami R. Seelam,et al.  Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[17]  Zizhong Chen,et al.  A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing , 2008, 2008 11th IEEE High Assurance Systems Engineering Symposium.

[18]  Satoshi Matsuoka,et al.  GPU accelerated computing—from hype to mainstream, the rebirth of vector computing , 2009 .

[19]  Satoshi Matsuoka The Road to TSUBAME and Beyond , 2008 .

[20]  Michael Lang,et al.  Entering the petaflop era: The architecture and performance of Roadrunner , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[22]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[23]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Ahmed Al-Nazer,et al.  On Disk-based and Diskless Checkpointing for Parallel and Distributed Systems: An Empirical Analysis , 2005 .

[25]  Lihao Xu,et al.  Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Network Storage Applications , 2006, Fifth IEEE International Symposium on Network Computing and Applications (NCA'06).

[26]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[27]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[28]  Takeshi Nakamura,et al.  Rupture process of the 2008 Wenchuan, China earthquake inferred from teleseismic waveform inversion and forward modeling of broadband seismic waves , 2010 .

[29]  IEEE Transactions on Parallel and Distributed Systems, Vol. 13 , 2002 .

[30]  Zizhong Chen,et al.  Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[31]  C. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[32]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[33]  D. Komatitsch,et al.  The Spectral-Element Method, Beowulf Computing, and Global Seismology , 2002, Science.

[34]  Franck Cappello,et al.  Low-overhead diskless checkpoint for hybrid computing systems , 2010, 2010 International Conference on High Performance Computing.

[35]  Franck Cappello,et al.  Distributed Diskless Checkpoint for Large Scale Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[36]  Matei Ripeanu,et al.  A GPU accelerated storage system , 2010, HPDC '10.

[37]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[38]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[39]  Bin Zhou,et al.  Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[40]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[41]  Masayuki Kikuchi,et al.  Inversion of complex body waves—III , 1991, Bulletin of the Seismological Society of America.

[42]  Masayuki Kikuchi,et al.  Inversion of complex body waves , 1982 .

[43]  Satoshi Matsuoka,et al.  NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[44]  B. Kennett,et al.  Traveltimes for global earthquake location and phase identification , 1991 .

[45]  G Bronevetsky,et al.  Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/sec File I/O , 2009 .

[46]  Robert B. Ross,et al.  Providing Efficient I/O Redundancy in MPI Environments , 2004, PVM/MPI.

[47]  Anthony Skjellum,et al.  Accelerating Reed-Solomon coding in RAID systems with GPUs , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[48]  James F. Doyle,et al.  The Spectral Element Method , 2020, Wave Propagation in Structures.

[49]  Chen Ji,et al.  Broadband modeling of the 2002 Denali fault earthquake on the Earth Simulator , 2003 .