A dynamic hardware redundancy mechanism for the in-field fault detection in cores of GPGPUs

In the past, in most General-Purpose Graphic Processing Units (GPGPUs) application fields (e.g., multimedia and gaming), the reliability features were not so relevant. Nowadays, GPGPUs are used in new domains, such as the automotive one, where reliability plays a significant role. In this work, we describe a dynamic duplication with a comparison (DDWC) mechanism intended to harden the Scalar Processor (SP) units located in the Streaming multiprocessors (SM) of a GPGPU. The proposed mechanism targets the permanent faults that may arise inside the SPs. One additional SP unit is included in the system to compute redundantly the same operations of a selected SP. Results are compared, and possible failures detected. A custom reconfiguration instruction allows the dynamic selection of the target SP to be monitored. Experimental results show that the proposed mechanism introduces a limited area overhead while it provides a significant increase in the in-field fault detection capabilities of the GPGPU. Its flexibility allows selecting the best trade-off between fault detection latency and performance overhead.

[1]  Sam Ainsworth,et al.  Parallel Error Detection Using Heterogeneous Cores , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[2]  Matteo Sonza Reorda,et al.  A dynamic reconfiguration mechanism to increase the reliability of GPGPUs , 2020, 2020 IEEE 38th VLSI Test Symposium (VTS).

[3]  Russell Tessier,et al.  FlexGrip: A soft GPGPU for FPGAs , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[4]  Kevin Skadron,et al.  A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors , 2007, GH '07.

[5]  Hyeran Jeon,et al.  Warped-DMR: Light-weight Error Detection for GPGPU , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[6]  Xin Li,et al.  Algorithm and hardware implementation for visual perception system in autonomous vehicle: A survey , 2017, Integr..

[7]  Fernanda Gusmão de Lima Kastensmidt,et al.  A low-level software-based fault tolerance approach to detect SEUs in GPUs' register files , 2017, Microelectron. Reliab..

[8]  S. L. Hurst VLSI testing and testability considerations an overview , 1988 .

[9]  José Rodrigo Azambuja,et al.  Evaluating the reliability of a GPU pipeline to SEU and the impacts of software-based and hardware-based fault tolerance techniques , 2018, Microelectron. Reliab..

[10]  Luigi Carro,et al.  On the evaluation of soft-errors detection techniques for GPGPUs , 2013, 2013 8th IEEE Design and Test Symposium.

[11]  Jaume Abella,et al.  High-Integrity GPU Designs for Critical Real-Time Automotive Systems , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[12]  Michael Nicolaidis,et al.  Reliability challenges of real-time systems in forthcoming technology nodes , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[13]  Matteo Sonza Reorda,et al.  Testing permanent faults in pipeline registers of GPGPUs: A multi-kernel approach , 2019, 2019 IEEE 25th International Symposium on On-Line Testing and Robust System Design (IOLTS).

[14]  Matteo Sonza Reorda,et al.  An extended model to support detailed GPGPU reliability analysis , 2019, 2019 14th International Conference on Design & Technology of Integrated Systems In Nanoscale Era (DTIS).

[15]  Luigi Carro,et al.  Modern GPUs Radiation Sensitivity Evaluation and Mitigation Through Duplication With Comparison , 2014, IEEE Transactions on Nuclear Science.

[16]  Kevin Skadron,et al.  Real-world design and evaluation of compiler-managed GPU redundant multithreading , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[17]  Stephen W. Keckler,et al.  Optimizing Software-Directed Instruction Replication for GPU Error Detection , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Paolo Rech,et al.  Selective Fault Tolerance for Register Files of Graphics Processing Units , 2019, IEEE Transactions on Nuclear Science.