Testing permanent faults in pipeline registers of GPGPUs: A multi-kernel approach

In the last decade, General Purpose Graphics Processing Units (GPGPUs) have been widely employed in high demanding data processing applications including multimedia and high-performance computing due to their parallel processing capabilities. Nowadays, these devices are considered as promising solutions also for high-performance safety-critical applications, such as autonomous and semi-autonomous vehicles. Current GPGPUs are designed targeting challenging execution requirements, e.g., related to performance and power constraints, forcing designers to use aggressive technology scaling solutions. Nevertheless, some implementation technologies are prone to introduce faults in the device during the operative life adding unaffordable effects and errors for the safety-critical domain. Hence, effective in-field test solutions are required to guarantee the target reliability levels. In this paper, we propose in-field test solutions based on Software-Based Self-Test (SBST) targeting the control-path of pipeline registers located in the Streaming Multiprocessor (SM) of a GPGPU. We resort to a multiple-kernel approach to detect permanent faults in these register fields. The solutions were designed employing NVIDIA CUDA, when possible, and lower level constructs elsewhere. Several usages and compilation restrictions are also described. Fault simulation results on an open-source VHDL GPGPU (FlexGrip) implementation of the G80 architecture of NVIDIA are reported, showing the effectiveness and limitations of the approach.

[1]  Michael Nicolaidis,et al.  Reliability challenges of real-time systems in forthcoming technology nodes , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[2]  Russell Tessier,et al.  FlexGrip: A soft GPGPU for FPGAs , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[3]  E. Ibe,et al.  Impact of Scaling on Neutron-Induced Soft Error in SRAMs From a 250 nm to a 22 nm Design Rule , 2010, IEEE Transactions on Electron Devices.

[4]  Fernanda Gusmão de Lima Kastensmidt,et al.  Evaluating the effects of single event upsets in soft-core GPGPUs , 2016, 2016 17th Latin-American Test Symposium (LATS).

[5]  Matteo Sonza Reorda,et al.  About the functional test of the GPGPU scheduler , 2018, 2018 IEEE 24th International Symposium on On-Line Testing And Robust System Design (IOLTS).

[6]  Luigi Carro,et al.  GPGPUs: How to combine high computational power with high reliability , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[7]  Dimitris Gizopoulos,et al.  GUFI: A framework for GPUs reliability assessment , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[8]  Raoul Velazco,et al.  A Survey on Fault Injection Techniques , 2004, Int. Arab J. Inf. Technol..

[9]  Stephen W. Keckler,et al.  SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[10]  Dan Alexandrescu,et al.  Circuit and System Level Single-Event Effects Modeling and Simulation , 2011 .

[11]  Xin Li,et al.  Algorithm and hardware implementation for visual perception system in autonomous vehicle: A survey , 2017, Integr..

[12]  Paolo Prinetto,et al.  A software-based self test of CUDA Fermi GPUs , 2013, 2013 18th IEEE European Test Symposium (ETS).

[13]  José Rodrigo Azambuja,et al.  Evaluating the reliability of a GPU pipeline to SEU and the impacts of software-based and hardware-based fault tolerance techniques , 2018, Microelectron. Reliab..

[14]  Luigi Carro,et al.  Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[15]  Jacob A. Abraham,et al.  Automatic generation of instruction sequences targeting hard-to-detect structural faults in a processor , 2006, 2006 IEEE International Test Conference.

[16]  Fernanda Gusmão de Lima Kastensmidt,et al.  A low-level software-based fault tolerance approach to detect SEUs in GPUs' register files , 2017, Microelectron. Reliab..

[17]  Matteo Sonza Reorda,et al.  About on-line functionally untestable fault identification in microprocessor cores for safety-critical applications , 2018, 2018 IEEE 19th Latin-American Test Symposium (LATS).

[18]  Paolo Prinetto,et al.  Increasing the robustness of CUDA Fermi GPU-based systems , 2013, 2013 IEEE 19th International On-Line Testing Symposium (IOLTS).