Software-Based Hardening Strategies for Neutron Sensitive FFT Algorithms on GPUs

In this paper we assess the neutron sensitivity of Graphics Processing Units (GPUs) when executing a Fast Fourier Transform (FFT) algorithm, and propose specific software-based hardening strategies to reduce its failure rate. Our research is motivated by experimental results with an unhardened FFT that demonstrate a majority of multiple errors in the output in the case of failures, which are caused by data dependencies. In addition, the use of the built-in error-correction code (ECC) showed a large overhead, and proved to be insufficient to provide high reliability. Experimental results with the hardened algorithm show a two orders of magnitude failure rate improvement over the original algorithm (one order of magnitude over ECC) and an overhead 64% smaller than ECC.

[1]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[2]  J. Krüger,et al.  Linear algebra operators for GPU implementation of numerical algorithms , 2003, ACM Trans. Graph..

[3]  Mark D. Lerner Algorithm Based Fault Tolerance in Massively Parallel Systems , 1988 .

[4]  Vijay S. Pande,et al.  Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU , 2009, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[5]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[6]  L. Carro,et al.  An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs , 2013, IEEE Transactions on Nuclear Science.

[7]  Jacob A. Abraham,et al.  Fault-Tolerant FFT Networks , 1988, IEEE Trans. Computers.

[8]  L. Carro,et al.  Neutron-Induced Soft Errors in Graphic Processing Units , 2012, 2012 IEEE Radiation Effects Data Workshop.

[9]  Kevin Skadron,et al.  A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors , 2007, GH '07.

[10]  Thomas G. Stockham,et al.  High-speed convolution and correlation , 1966, AFIPS '66 (Spring).

[11]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[12]  S. Pontarelli,et al.  A New Hardware/Software Platform and a New 1/E Neutron Source for Soft Error Studies: Testing FPGAs at the ISIS Facility , 2007, IEEE Transactions on Nuclear Science.