GPUburn: A system to test and mitigate GPU hardware failures

Due to many factors such as, high transistor density, high frequency, and low voltage, today's processors are more than ever subject to hardware failures. These errors have various impacts depending on the location of the error and the type of processor. Because of the hierarchical structure of the compute units and work scheduling, the hardware failure on GPUs affect only part of the application. In this paper we present a new methodology to characterize the hardware failures of Nvidia GPUs based on a software micro-benchmarking platform implemented in OpenCL. We also present which hardware part of TESLA architecture is more sensitive to intermittent errors, which usually appears when the processor is aging. We obtained these results by accelerating the aging process by running the processors at high temperature. We show that on GPUs, intermittent errors impact is limited to a localized architecture tile. Finally, we propose a methodology to detect, record location of defective units in order to avoid them to ensure the program correctness on such architectures, improving the GPU fault-tolerance capability and lifespan.

[1]  Ravishankar K. Iyer,et al.  Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[2]  Arnaud Tisserand,et al.  Power Consumption of GPUs from a Software Perspective , 2009, ICCS.

[3]  Vijay S. Pande,et al.  Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU , 2009, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[4]  Valeria Bertacco,et al.  Post-Silicon and Runtime Verification for Modern Processors , 2010 .

[5]  Josep Torrellas,et al.  Threshold Voltage Variation Effects on Aging-Related Hard Failure Rates , 2007, 2007 IEEE International Symposium on Circuits and Systems.

[6]  Kevin Skadron,et al.  A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors , 2007, GH '07.

[7]  Marina Schroder Facing the Multicore-Challenge - Aspects of New Paradigms and Technologies in Parallel Computing [Proceedings of a conference held at the Heidelberger Akademie der Wissenschaften, March 17-19, 2010] , 2011, Facing the Multicore-Challenge.

[8]  Nicolas Ventroux,et al.  Impact of the application activity on intermittent faults in embedded systems , 2011, 29th VLSI Test Symposium.

[9]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[10]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[11]  L. Carro,et al.  Neutron-Induced Soft Errors in Graphic Processing Units , 2012, 2012 IEEE Radiation Effects Data Workshop.

[12]  Xin Fu,et al.  Analyzing soft-error vulnerability on GPGPU microarchitecture , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[13]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[14]  Takeo Kanade Facing the Multicore - Challenge II , 2012, Lecture Notes in Computer Science.

[15]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[16]  Jiri Kraus,et al.  Efficient AMG on Heterogeneous Systems , 2011, Facing the Multicore-Challenge.

[17]  Julien Guilhemsang,et al.  Test en ligne pour la détection des fautes intermittentes dans les architectures multiprocesseurs embarquées. (Online test for the detection of intermittent faults in embedded multiprocessor architectures) , 2011 .

[18]  John R. Douceur,et al.  Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[19]  Aharon Aharon,et al.  Test Program Generation for Functional Verification of PowePC Processors in IBM , 1995, 32nd Design Automation Conference.

[20]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[21]  Nicolas Ventroux,et al.  Analysis of on-line self-testing policies for real-time embedded multiprocessors in DSM technologies , 2010, 2010 IEEE 16th International On-Line Testing Symposium.

[22]  Margarita Amor,et al.  Influence of memory access patterns to small-scale FFT performance , 2012, The Journal of Supercomputing.

[23]  David Defour,et al.  Barra: A Parallel Functional Simulator for GPGPU , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[24]  Kai Lu,et al.  TH-1: China’s first petaflop supercomputer , 2010, Frontiers of Computer Science in China.