论文信息 - FTI: High performance Fault Tolerance Interface for hybrid systems

FTI: High performance Fault Tolerance Interface for hybrid systems

Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in which we integrate a highly-reliable topology-aware Reed-Solomon encoding in a three-level checkpoint scheme. We efficiently hide the encoding time using one Fault-Tolerance dedicated thread per node. We implement our technique in the Fault Tolerance Interface FTI. We evaluate the correctness of our performance model and conduct a study of the reliability of our library. To demonstrate the performance of FTI, we present a case study of the Mw9.0 Tohoku Japan earthquake simulation with SPECFEM3D on TSUBAME2.0. We demonstrate a checkpoint overhead as low as 8% on sustained 0.1 petaflops runs (1152 GPUs) while check-pointing at high frequency.

[1] Charng-da Lu,et al. Scalable Diskless Checkpointing for Large Parallel Systems , 2005 .

[2] J. Duell. The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[3] Jason Duell,et al. The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[4] Franck Cappello,et al. Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[5] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .

[6] Yuan Xie,et al. Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[7] Catherine D. Schuman,et al. A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage , 2009, FAST.

[8] Chen Ji,et al. A 14.6 billion degrees of freedom, 5 teraflops, 2.5 terabyte earthquake simulation on the Earth Simulator , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[9] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[10] Jason Duell,et al. The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[11] Gordon Erlebacher,et al. Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA , 2009, J. Parallel Distributed Comput..

[12] Gordon Erlebacher,et al. High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster , 2010, J. Comput. Phys..

[13] Eric Roman. A Survey of Checkpoint / Restart Implementations , 2002 .

[14] Rong Zeng,et al. The Design and Implementation of , 2002 .

[15] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.

[16] Seetharami R. Seelam,et al. Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[17] Zizhong Chen,et al. A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing , 2008, 2008 11th IEEE High Assurance Systems Engineering Symposium.

[18] Satoshi Matsuoka,et al. GPU accelerated computing—from hype to mainstream, the rebirth of vector computing , 2009 .

[19] Satoshi Matsuoka. The Road to TSUBAME and Beyond , 2008 .

[20] Michael Lang,et al. Entering the petaflop era: The architecture and performance of Roadrunner , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[21] Eduardo Pinheiro,et al. DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[22] Shekhar Y. Borkar,et al. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[23] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[24] Ahmed Al-Nazer,et al. On Disk-based and Diskless Checkpointing for Parallel and Distributed Systems: An Empirical Analysis , 2005 .

[25] Lihao Xu,et al. Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Network Storage Applications , 2006, Fifth IEEE International Symposium on Network Computing and Applications (NCA'06).

[26] Kai Li,et al. Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[27] John Bent,et al. PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[28] Takeshi Nakamura,et al. Rupture process of the 2008 Wenchuan, China earthquake inferred from teleseismic waveform inversion and forward modeling of broadband seismic waves , 2010 .

[29] IEEE Transactions on Parallel and Distributed Systems, Vol. 13 , 2002 .

[30] Zizhong Chen,et al. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[31] C. Lawson,et al. Solving least squares problems , 1976, Classics in applied mathematics.

[32] Charles L. Lawson,et al. Solving least squares problems , 1976, Classics in applied mathematics.

[33] D. Komatitsch,et al. The Spectral-Element Method, Beowulf Computing, and Global Seismology , 2002, Science.

[34] Franck Cappello,et al. Low-overhead diskless checkpoint for hybrid computing systems , 2010, 2010 International Conference on High Performance Computing.

[35] Franck Cappello,et al. Distributed Diskless Checkpoint for Large Scale Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[36] Matei Ripeanu,et al. A GPU accelerated storage system , 2010, HPDC '10.

[37] P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[38] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[39] Bin Zhou,et al. Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[40] Frank B. Schmuck,et al. GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[41] Masayuki Kikuchi,et al. Inversion of complex body waves—III , 1991, Bulletin of the Seismological Society of America.

[42] Masayuki Kikuchi,et al. Inversion of complex body waves , 1982 .

[43] Satoshi Matsuoka,et al. NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[44] B. Kennett,et al. Traveltimes for global earthquake location and phase identification , 1991 .

[45] G Bronevetsky,et al. Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/sec File I/O , 2009 .

[46] Robert B. Ross,et al. Providing Efficient I/O Redundancy in MPI Environments , 2004, PVM/MPI.

[47] Anthony Skjellum,et al. Accelerating Reed-Solomon coding in RAID systems with GPUs , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[48] James F. Doyle,et al. The Spectral Element Method , 2020, Wave Propagation in Structures.

[49] Chen Ji,et al. Broadband modeling of the 2002 Denali fault earthquake on the Earth Simulator , 2003 .