Transient Fault Resilient QR Factorization on GPUs

With their inherent capability to exploit parallelism, GPUs have become a popular platform for data-intensive scientific computing applications. This trend is expected to continue as the number of computations required by scientific applications reach the petascale and even exascale range. As the minimum feature size of transistors decreases due to improving process technology, GPUs are becoming more vulnerable to transient faults caused by events such as power fluctuations and alpha particle strikes, therefore we need methods that guarantee correct computation even in the presence of such faults. In this paper, we develop and analyze three fault tolerant schemes, FC-O, PC-C and PC-CS, for the block Householder QR algorithm that can deal with faults in the streaming processor (SP) core of a GPU. We also present a transient fault injection mechanism for NVIDIA GPUs, which has the capability of injecting faults of varying durations. We show that two of our schemes, PC-C and PC-CS, have good error coverage and relatively low overhead, and can scale reasonably well at the petascale and exascale range.

[1]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[2]  Satoshi Matsuoka,et al.  A high-performance fault-tolerant software framework for memory on commodity GPUs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[3]  Chin-Long Chen,et al.  Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..

[4]  Claus Braun,et al.  A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matrix Multiplications on Graphics Processing Units , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[5]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[6]  Tilak Agerwala Exascale computing: The challenges and opportunities in the next decade , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[7]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[8]  Richard W. Vuduc,et al.  Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU) , 2012, Synthesis Lectures on Computer Architecture.

[9]  Robert A. van de Geijn,et al.  Solving “large” dense matrix problems on multi-core processors , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[10]  Dan Negrut,et al.  Implicit Integration in Molecular Dynamics Simulation , 2008 .

[11]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[12]  J. T. Oden,et al.  Massively parallel computation for acoustical scattering problems using boundary element methods , 1996 .

[13]  Bo Fang,et al.  GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[14]  Hyesoon Kim,et al.  Performance Analysis and Tuning for General Purpose Graphics Processing Units , 2012 .

[15]  Christian H. Bischof,et al.  The WY representation for products of householder matrices , 1985, PPSC.

[16]  Huiyang Zhou,et al.  Understanding software approaches for GPGPU reliability , 2009, GPGPU-2.

[17]  Jack J. Dongarra,et al.  Soft error resilient QR factorization for hybrid system with GPGPU , 2011, ScalA '11.

[18]  Luigi Carro,et al.  Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[19]  Michael A. Heroux Software Challenges for Extreme Scale Computing: Going From Petascale to Exascale Systems , 2009, Int. J. High Perform. Comput. Appl..

[20]  Ravishankar K. Iyer,et al.  Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[21]  Mark A. Richards,et al.  QR decomposition on GPUs , 2009, GPGPU-2.

[22]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .