PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based fault-tolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens.

[1]  Slo-Li Chu,et al.  OpenCL: Make Ubiquitous Supercomputing Possible , 2010, 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC).

[2]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[3]  Eric Roman A Survey of Checkpoint / Restart Implementations , 2002 .

[4]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[5]  Huiyang Zhou,et al.  Understanding software approaches for GPGPU reliability , 2009, GPGPU-2.

[6]  Elena Dubrova,et al.  Fault Tolerant Design : An Introduction , 2013 .

[7]  Jens H. Krüger,et al.  GPGPU: general purpose computation on graphics hardware , 2004, SIGGRAPH '04.

[8]  Satoshi Matsuoka,et al.  A high-performance fault-tolerant software framework for memory on commodity GPUs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[9]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[10]  C. V. Ramamoorthy,et al.  Rollback and Recovery Strategies for Computer Programs , 1972, IEEE Transactions on Computers.

[11]  Sudhanva Gurumurthi,et al.  Towards Transient Fault Tolerance for Heterogeneous Computing Platforms , 2008 .

[12]  Kevin Skadron,et al.  A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors , 2007, GH '07.

[13]  Axel W. Krings,et al.  Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing , 2009, IEEE Transactions on Dependable and Secure Computing.

[14]  Satoshi Matsuoka,et al.  Software-Based ECC for GPUs , 2011 .

[15]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[16]  L. Borucki,et al.  Comparison of accelerated DRAM soft error rates measured at component and system level , 2008, 2008 IEEE International Reliability Physics Symposium.

[17]  Ravishankar K. Iyer,et al.  Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[18]  Niraj K. Jha,et al.  Fault-tolerant computer system design , 1996, IEEE Parallel & Distributed Technology: Systems & Applications.

[19]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[20]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[21]  Massimo Violante,et al.  Software-Implemented Hardware Fault Tolerance , 2010 .