On the Efficacy of ECC and the Benefits of FinFET Transistor Layout for GPU Reliability

Using error-correcting codes (ECCs) is considered one of the most effective ways to mask the effects of radiation-induced faults in memory and computing devices. Unfortunately, with the increased complexity of modern processors, there is a growing amount of hidden logic and memory resources, such as flip-flops in internal pipelines and queues, that cannot be easily protected by ECC. In this paper, we experimentally investigate the efficacy of using ECC to mask neutron-induced faults in modern graphics processing units (GPUs). In our analysis, we consider GPUs fabricated in CMOS and FinFET technologies. We show that changes in transistor technology can be as beneficial as using ECC for reducing silent data corruption rates. Finally, we compare fault-injection results, as carried out both on internal registers and at an instruction level, to better understand the effectiveness of ECC.

[1]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[2]  Luigi Carro,et al.  GPGPUs: How to combine high computational power with high reliability , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[3]  M. Lopez-Vallejo,et al.  System Design Framework and Methodology for Xilinx Virtex FPGA Configuration Scrubbers , 2014, IEEE Transactions on Nuclear Science.

[4]  Cristian Constantinescu,et al.  Impact of deep submicron technology on dependability of VLSI circuits , 2002, Proceedings International Conference on Dependable Systems and Networks.

[5]  Myron Hlynka,et al.  Queueing Networks and Markov Chains (Modeling and Performance Evaluation With Computer Science Applications) , 2007, Technometrics.

[6]  Sudhakar Yalamanchili,et al.  Reliability-performance tradeoffs between 2.5D and 3D-stacked DRAM processors , 2016, 2016 IEEE International Reliability Physics Symposium (IRPS).

[7]  Luigi Carro,et al.  Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[8]  Dan Alexandrescu,et al.  Study of Neutron Soft Error Rate (SER) Sensitivity: Investigation of Upset Mechanisms by Comparative Simulation of FinFET and Planar MOSFET SRAMs , 2015, IEEE Transactions on Nuclear Science.

[9]  Thiago Santini,et al.  Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units , 2016, IEEE Transactions on Computers.

[10]  Xin Fu,et al.  Analyzing soft-error vulnerability on GPGPU microarchitecture , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[11]  Luigi Carro,et al.  Evaluation and Mitigation of Soft-Errors in Neural Network-Based Object Detection in Three GPU Architectures , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[12]  J. S. Kauppila,et al.  Radiation hardness of FDSOI and FinFET technologies , 2011, IEEE 2011 International SOI Conference.

[13]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Luigi Carro,et al.  Modern GPUs Radiation Sensitivity Evaluation and Mitigation Through Duplication With Comparison , 2014, IEEE Transactions on Nuclear Science.

[15]  Mauricio Hanzich,et al.  Mimetic seismic wave modeling including topography on deformed staggered grids , 2014 .

[16]  Luigi Carro,et al.  Memory Access Time and Input Size Effects on Parallel Processors Reliability , 2015, IEEE Transactions on Nuclear Science.

[17]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[18]  Luigi Carro,et al.  Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[19]  Yo-Hwan Koh,et al.  A low power and highly reliable 400Mbps mobile DDR SDRAM with on-chip distributed ECC , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[20]  Ronald D. Schrimpf,et al.  Bias Dependence of Single-Event Upsets in 16 nm FinFET D-Flip-Flops , 2015, IEEE Transactions on Nuclear Science.

[21]  Melvin A. Breuer,et al.  Defect and error tolerance in the presence of massive numbers of defects , 2004, IEEE Design & Test of Computers.

[22]  Mehdi Baradaran Tahoori,et al.  Obtaining Microprocessor Vulnerability Factor Using Formal Methods , 2008, 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems.

[23]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[24]  B. L. Bhuva,et al.  Comparison of Combinational and Sequential Error Rates for a Deep Submicron Process , 2011, IEEE Transactions on Nuclear Science.

[25]  Stephen W. Keckler,et al.  SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[26]  Melvin A. Breuer,et al.  Multi-media applications and imprecise computation , 2005, 8th Euromicro Conference on Digital System Design (DSD'05).

[27]  Luigi Carro,et al.  Threads Distribution Effects on Graphics Processing Units Neutron Sensitivity , 2013, IEEE Transactions on Nuclear Science.

[28]  Lloyd W. Massengill,et al.  Impact of scaling on soft-error rates in commercial microprocessors , 2002 .

[29]  Laura Monroe,et al.  Experimental and Analytical Analysis of Sorting Algorithms Error Criticality for HPC and Large Servers Applications , 2017, IEEE Transactions on Nuclear Science.

[30]  David R. Kaeli,et al.  The Effect of Input Data on Program Vulnerability , 2009 .

[31]  M. Baze,et al.  Comparison of error rates in combinational and sequential logic , 1997 .

[32]  Ravishankar K. Iyer,et al.  An experimental study of soft errors in microprocessors , 2005, IEEE Micro.

[33]  Claus Braun,et al.  Efficacy and efficiency of algorithm-based fault-tolerance on GPUs , 2013, 2013 IEEE 19th International On-Line Testing Symposium (IOLTS).

[34]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[35]  Laura Monroe,et al.  GPU Behavior on a Large HPC Cluster , 2013, Euro-Par Workshops.

[36]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[37]  David Blaauw,et al.  Using Low Cost Erasure and Error Correction Schemes to Improve Reliability of Commodity DRAM Systems , 2016, IEEE Transactions on Computers.