GPUs Reliability Dependence on Degree of Parallelism

A higher Degree of Parallelism decreases the code execution time. However, a higher scheduler strain is necessary to manage the increased number of parallel processes, which have the countermeasure of increasing the graphics processing unit (GPU) cross section. This hypothesis is confirmed by extensive neutron irradiation testing to study how the Degree of Parallelism affects GPU reliability. Moreover, the cache distribution forced by different Degrees of Parallelism is proved to influence the application error rate. Finally, the Mean Executions Between Failures metric is introduced to evaluate the amount of data computed correctly by the GPU on a practical application.

[1]  Vijay S. Pande,et al.  Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU , 2009, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[2]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[3]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[4]  Rüdiger Westermann,et al.  Linear algebra operators for GPU implementation of numerical algorithms , 2003, SIGGRAPH Courses.

[5]  Lloyd W. Massengill,et al.  Impact of scaling on soft-error rates in commercial microprocessors , 2002 .

[6]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[7]  Yo-Hwan Koh,et al.  A low power and highly reliable 400Mbps mobile DDR SDRAM with on-chip distributed ECC , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[8]  Luigi Carro,et al.  Threads Distribution Effects on Graphics Processing Units Neutron Sensitivity , 2013, IEEE Transactions on Nuclear Science.

[9]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[10]  Erika Cule,et al.  ABC-SysBio—approximate Bayesian computation in Python with GPU support , 2010, Bioinform..

[11]  Volodymyr Kindratenko,et al.  On testing GPU memory for hard and soft errors , 2011 .

[12]  L. Carro,et al.  Neutron-Induced Soft Errors in Graphic Processing Units , 2012, 2012 IEEE Radiation Effects Data Workshop.

[13]  Kevin Skadron,et al.  A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors , 2007, GH '07.

[14]  S. Pontarelli,et al.  A New Hardware/Software Platform and a New 1/E Neutron Source for Soft Error Studies: Testing FPGAs at the ISIS Facility , 2007, IEEE Transactions on Nuclear Science.

[15]  L. Carro,et al.  An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs , 2013, IEEE Transactions on Nuclear Science.