Estimating Silent Data Corruption Rates Using a Two-Level Model

High-performance and safety-critical system architects must accurately evaluate the application-level silent data corruption (SDC) rates of processors to soft errors. Such an evaluation requires error propagation all the way from particle strikes on low-level state up to the program output. Existing approaches that rely on low-level simulations with fault injection cannot evaluate full applications because of their slow speeds, while application-level accelerated fault testing in accelerated particle beams is often impractical. We present a new two-level methodology for application resilience evaluation that overcomes these challenges. The proposed approach decomposes application failure rate estimation into (1) identifying how particle strikes in low-level unprotected state manifest at the architecture-level, and (2) measuring how such architecture-level manifestations propagate to the program output. We demonstrate the effectiveness of this approach on GPU architectures. We also show that using just one of the two steps can overestimate SDC rates and produce different trends---the composition of the two is needed for accurate reliability modeling.

[1]  Arijit Biswas,et al.  A fast and accurate analytical technique to compute the AVF of sequential bits in a processor , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[3]  Luigi Carro,et al.  Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[4]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Karthik Pattabiraman,et al.  Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[6]  Stephen W. Keckler,et al.  SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[7]  Ravishankar K. Iyer,et al.  An experimental study of soft errors in microprocessors , 2005, IEEE Micro.

[8]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[9]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[10]  Kevin Skadron,et al.  Real-world design and evaluation of compiler-managed GPU redundant multithreading , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[11]  Sarita V. Adve,et al.  Accurate microarchitecture-level fault modeling for studying hardware faults , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[12]  Ravishankar K. Iyer,et al.  Hierarchical Simulation Approach to Accurate Fault Modeling for System Dependability Evaluation , 1999, IEEE Trans. Software Eng..

[13]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[14]  Karthikeyan Sankaralingam,et al.  Understanding the impact of gate-level physical reliability effects on whole program execution , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[15]  Seyed Ghassem Miremadi,et al.  A fast, flexible, and easy-to-develop FPGA-based fault injection technique , 2014, Microelectron. Reliab..

[16]  Laura Monroe,et al.  SDC is in the Eye of the Beholder: A Survey and Preliminary Study , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W).

[17]  Arijit Biswas,et al.  Computing architectural vulnerability factors for address-based structures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[18]  Karthik Pattabiraman,et al.  Modeling Soft-Error Propagation in Programs , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[19]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[20]  David W. Nellans,et al.  Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[21]  Mattan Erez,et al.  Hamartia: A Fast and Accurate Error Injection Framework , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[22]  Hyeran Jeon,et al.  Warped-DMR: Light-weight Error Detection for GPGPU , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[23]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[24]  Ravishankar K. Iyer,et al.  Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[25]  Thiago Santini,et al.  Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units , 2016, IEEE Transactions on Computers.

[26]  Huntington W. Curtis,et al.  Accelerated testing for cosmic soft-error rate , 1996, IBM J. Res. Dev..

[27]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Michail Maniatakos,et al.  Instruction-Level Impact Analysis of Low-Level Faults in a Modern Microprocessor Controller , 2011, IEEE Transactions on Computers.

[30]  G. Choi,et al.  FOCUS: an experimental environment for validation of fault-tolerant systems-case study of a jet-engine controller , 1989, Proceedings 1989 IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[31]  Bo Fang,et al.  GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).