Neutron sensitivity and software hardening strategies for matrix multiplication and FFT on graphics processing units

In this paper, we compare the radiation response of GPUs executing matrix multiplication and FFT algorithms. The provided experimental results demonstrate that for both algorithms, in the majority of cases, the output is affected by multiple errors. The architectural and code analysis highlight that multiple errors are caused by shared resources corruption or thread dependencies. The experimental data and analytical studies can be fruitfully employed to evaluate the expected error rate of GPUs in realistic applications and to design specific and optimized software-based hardening procedures.

[1]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[2]  N. Seifert,et al.  Chip-level soft error estimation method , 2005, IEEE Transactions on Device and Materials Reliability.

[3]  L. Carro,et al.  Neutron-Induced Soft Errors in Graphic Processing Units , 2012, 2012 IEEE Radiation Effects Data Workshop.

[4]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[5]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[6]  Lloyd W. Massengill,et al.  Impact of scaling on soft-error rates in commercial microprocessors , 2002 .

[7]  E. Normand Single-event effects in avionics , 1996 .

[8]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[9]  Rüdiger Westermann,et al.  Linear algebra operators for GPU implementation of numerical algorithms , 2003, SIGGRAPH Courses.

[10]  Luigi Carro,et al.  Experimental evaluation of thread distribution effects on multiple output errors in GPUs , 2013, 2013 18th IEEE European Test Symposium (ETS).

[11]  Saverio Caminiti,et al.  Dynamic programming in faulty memory hierarchies (cache-obliviously) , 2011, FSTTCS.

[12]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[13]  Thomas G. Stockham,et al.  High-speed convolution and correlation , 1966, AFIPS '66 (Spring).

[14]  Rusins Freivalds,et al.  Fast Probabilistic Algorithms , 1979, MFCS.

[15]  Erika Cule,et al.  ABC-SysBio—approximate Bayesian computation in Python with GPU support , 2010, Bioinform..

[16]  Luigi Carro,et al.  Neutron radiation test of graphic processing units , 2012, 2012 IEEE 18th International On-Line Testing Symposium (IOLTS).

[17]  R. Baumann,et al.  Neutron-induced boron fission as a major source of soft errors in deep submicron SRAM devices , 2000, 2000 IEEE International Reliability Physics Symposium Proceedings. 38th Annual (Cat. No.00CH37059).

[18]  S. Pontarelli,et al.  A New Hardware/Software Platform and a New 1/E Neutron Source for Soft Error Studies: Testing FPGAs at the ISIS Facility , 2007, IEEE Transactions on Nuclear Science.