Experimental evaluation of thread distribution effects on multiple output errors in GPUs

Graphic Processing Units are very prone to be corrupted by neutrons. Experimental results show that in the majority of the cases a typical application like matrix multiplication is affected by multiple output errors. In this paper we evaluate how different thread distributions impact the multiple output errors occurrence. The reported results and the performed architecture analysis give practical programming advices that may increase the reliability of a generic parallel algorithm without introducing any hardware or computation overhead.

[1]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[2]  E. Normand Single-event effects in avionics , 1996 .

[3]  L. Carro,et al.  An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs , 2013, IEEE Transactions on Nuclear Science.

[4]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[5]  Cecilia Metra,et al.  Low cost and low intrusive approach to test on-line the scheduler of high performance microprocessors , 2010, CF '10.

[6]  Erika Cule,et al.  ABC-SysBio—approximate Bayesian computation in Python with GPU support , 2010, Bioinform..

[7]  Luigi Carro,et al.  Neutron radiation test of graphic processing units , 2012, 2012 IEEE 18th International On-Line Testing Symposium (IOLTS).

[8]  L. Carro,et al.  Neutron-Induced Soft Errors in Graphic Processing Units , 2012, 2012 IEEE Radiation Effects Data Workshop.

[9]  S. Pontarelli,et al.  A New Hardware/Software Platform and a New 1/E Neutron Source for Soft Error Studies: Testing FPGAs at the ISIS Facility , 2007, IEEE Transactions on Nuclear Science.

[10]  Michail Maniatakos,et al.  Workload-Cognizant Concurrent Error Detection in the Scheduler of a Modern Microprocessor , 2011, IEEE Transactions on Computers.

[11]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[12]  Melvin A. Breuer,et al.  Defect and error tolerance in the presence of massive numbers of defects , 2004, IEEE Design & Test of Computers.