Multi-faceted microarchitecture level reliability characterization for NVIDIA and AMD GPUs

State-of-the-art GPU chips are designed to deliver extreme throughput for graphics as well as for data-parallel general purpose computing workloads (GPGPU computing). Unlike computing for graphics, GPGPU computing requires highly reliable operations. Since provisioning for high reliability may affect performance, the design of GPGPU systems requires the vulnerability of GPU workloads to soft-errors to be jointly evaluated with the performance of GPU chips. We present an extended study based on a consolidated workflow for the evaluation of the reliability in correlation with the performance of four GPU architectures and corresponding chips: AMD Southern Islands and NVIDIA G80/GT200/Fermi. We obtained reliability measurements (AVF and FIT) employing both fault injection and ACE-analysis based on microarchitecture-level simulators. Apart from the reliability-only and performance-only measurements, we propose combined metrics for performance and reliability that assist comparisons for the same application among GPU chips of different ISAs and vendors, as well as among benchmarks on the same GPU chip.

[1]  Luigi Carro,et al.  Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[2]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[3]  Stefano Di Carlo,et al.  SIFI: AMD southern islands GPU microarchitectural level fault injector , 2017, 2017 IEEE 23rd International Symposium on On-Line Testing and Robust System Design (IOLTS).

[4]  Alin Achim,et al.  2016 Design, Automation & Test in Europe Conference & Exhibition (DATE 2016) , 2016, DATE 2016.

[5]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[6]  Vijay S. Pande,et al.  Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU , 2009, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[7]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[8]  Mehdi Baradaran Tahoori,et al.  Balancing Performance and Reliability in the Memory Hierarchy , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[9]  Sanjay J. Patel,et al.  Examining ACE analysis reliability estimates using fault-injection , 2007, ISCA '07.

[10]  Stefano Di Carlo,et al.  RT Level vs. Microarchitecture-Level Reliability Assessment: Case Study on ARM(R) Cortex(R)-A9 CPU , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[11]  Stephen W. Keckler,et al.  SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[12]  L. Carro,et al.  Neutron-Induced Soft Errors in Graphic Processing Units , 2012, 2012 IEEE Radiation Effects Data Workshop.

[13]  Stefano Di Carlo,et al.  Microarchitecture level reliability comparison of modern GPU designs: First findings , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[14]  E. Ibe,et al.  Impact of Scaling on Neutron-Induced Soft Error in SRAMs From a 250 nm to a 22 nm Design Rule , 2010, IEEE Transactions on Electron Devices.

[15]  Dimitris Gizopoulos,et al.  Performance-aware reliability assessment of heterogeneous chips , 2017, 2017 IEEE 35th VLSI Test Symposium (VTS).

[16]  Bo Fang,et al.  GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[17]  Arijit Biswas,et al.  Computing architectural vulnerability factors for address-based structures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[18]  Dimitris Gizopoulos,et al.  GUFI: A framework for GPUs reliability assessment , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[19]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.