PEPPA-X: Finding Program Test Inputs to Bound Silent Data Corruption Vulnerability in HPC Applications

Transient hardware faults have become prevalent due to the shrinking size of transistors, leading to silent data corruptions (SDCs). Therefore, HPC applications need to be evaluated (e.g., via fault injections) and protected to meet the reliability target. In the evaluation, the target programs exercise with a set of given inputs which are usually from program benchmark suite. However, these inputs rarely manifest the SDC vulnerabilities, leading to over-optimistic assessment and unexpectedly higher failure rates in production. We propose Peppa-X, which efficiently identifies the test inputs that estimate the bound of program SDC resiliency. Our key insight is that the SDC sensitivity distribution in a program often remains stationary across input space. Thereby, we can guide the search of SDC-bound inputs by a sampled distribution. Our evaluation shows that Peppa-X can identify the SDC-bound input of a program that existing methods cannot find even with 5x more search time.

[1]  Roger Johansson,et al.  On the Impact of Hardware Faults - An Investigation of the Relationship between Workload Inputs and Failure Mode Distributions , 2012, SAFECOMP.

[2]  Lei Xu,et al.  Life after Speech Recognition: Fuzzing Semantic Misinterpretation for Voice Assistant Applications , 2019, NDSS.

[3]  Martin Schulz,et al.  IPAS: Intelligent protection against silent output corruption in scientific applications , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[4]  Sriram Sankar,et al.  Silent Data Corruptions at Scale , 2021, ArXiv.

[5]  Franck Cappello,et al.  Towards End-to-end SDC Detection for HPC Applications Equipped with Lossy Compression , 2020, 2020 IEEE International Conference on Cluster Computing (CLUSTER).

[6]  Martin Schulz,et al.  REFINE: Realistic Fault Injection via Compiler-based Instrumentation for Accuracy, Portability and Speed , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  LADR , 2018, Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing.

[8]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[9]  Adwait Jog,et al.  Enabling Software Resilience in GPGPU Applications via Partial Thread Protection , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[10]  Peng Li,et al.  SAVIOR: Towards Bug-Driven Hybrid Testing , 2019, 2020 IEEE Symposium on Security and Privacy (SP).

[11]  Corina S. Pasareanu,et al.  DifFuzz: Differential Fuzzing for Side-Channel Analysis , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[12]  Dong Li,et al.  Quantitatively Modeling Application Resilience with the Data Vulnerability Factor , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Cen Zhang,et al.  MUZZ: Thread-aware Grey-box Fuzzing for Effective Bug Hunting in Multithreaded Programs , 2020, USENIX Security Symposium.

[14]  Harshitha Menon,et al.  DisCVar: discovering critical variables using algorithmic differentiation for transient faults , 2018, PPoPP.

[15]  Guanpeng Li,et al.  A Tale of Two Injectors: End-to-End Comparison of IR-Level and Assembly-Level Fault Injection , 2019, 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE).

[16]  C. Constantinescu,et al.  Intermittent faults and effects on reliability of integrated circuits , 2008, 2008 Annual Reliability and Maintainability Symposium.

[17]  Cornelius Aschermann,et al.  Ijon: Exploring Deep State Spaces via Fuzzing , 2020, 2020 IEEE Symposium on Security and Privacy (SP).

[18]  Yang Liu,et al.  Cerebro: context-aware adaptive fuzzing for effective vulnerability detection , 2019, ESEC/SIGSOFT FSE.

[19]  Johan Karlsson,et al.  One Bit is (Not) Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[20]  Karthik Pattabiraman,et al.  Modeling Soft-Error Propagation in Programs , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[21]  Evgenia Smirni,et al.  Practical Resilience Analysis of GPGPU Applications in the Presence of Single- and Multi-Bit Faults , 2021, IEEE Transactions on Computers.

[22]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[23]  D. Kaeli,et al.  ArmorAll: Compiler-based Resilience Targeting GPU Applications , 2020, ACM Trans. Archit. Code Optim..

[24]  Karthik Pattabiraman,et al.  Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[25]  Dong Li,et al.  Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26]  Bihuan Chen,et al.  Hawkeye: Towards a Desired Directed Grey-box Fuzzer , 2018, CCS.

[27]  Daniel P. Siewiorek,et al.  Observations on the Effects of Fault Manifestation as a Function of Workload , 1992, IEEE Trans. Computers.

[28]  Christof Fetzer,et al.  SpecFuzz: Bringing Spectre-type vulnerabilities to the surface , 2019, USENIX Security Symposium.

[29]  Sarita V. Adve,et al.  Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.

[30]  J. Karlsson,et al.  The Effects of Workload Input Domain On Fault Injection Results , 1999 .

[31]  Near-Zero Downtime Recovery From Transient-Error-Induced Crashes , 2021, IEEE Transactions on Parallel and Distributed Systems.

[32]  G. B. Mathews On the Partition of Numbers , 1896 .

[33]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[34]  Evgenia Smirni Practical Reliability Analysis of GPGPUs in the Wild: From Systems to Applications , 2019, ICPE.

[35]  Nicolas Wu,et al.  Reasoning about effect interaction by fusion , 2021, Proc. ACM Program. Lang..

[36]  R. Haupt Optimum population size and mutation rate for a simple real genetic algorithm that optimizes array factors , 2000, IEEE Antennas and Propagation Society International Symposium. Transmitting Waves of Progress to the Next Millennium. 2000 Digest. Held in conjunction with: USNC/URSI National Radio Science Meeting (C.

[37]  Dong Li,et al.  MOARD: Modeling Application Resilience to Transient Faults on Data Objects , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[38]  Karthik Pattabiraman,et al.  Modeling Input-Dependent Error Propagation in Programs , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[39]  Martin Schulz,et al.  FlipTracker: Understanding Natural Error Resilience in HPC Applications , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[40]  Dinghao Wu,et al.  SQUIRREL: Testing Database Management Systems with Language Validity and Coverage Feedback , 2020, CCS.

[41]  Dong Li,et al.  PARIS: Predicting Application Resilience Using Machine Learning , 2018, J. Parallel Distributed Comput..

[42]  SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing , 2021, Proc. ACM Meas. Anal. Comput. Syst..

[43]  Darko Marinov,et al.  Minotaur: Adapting Software Testing Techniques for Hardware Errors , 2019, ASPLOS.

[44]  Bin Nie,et al.  Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[45]  Valerio Pascucci,et al.  Understanding a program's resiliency through error propagation , 2021, PPoPP.

[46]  Pradip Bose,et al.  Understanding Error Propagation in GPGPU Applications , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[47]  MemLock , 2020, Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering.

[48]  John Shalf,et al.  DOE Advanced Scientific Computing Advisory Subcommittee (ASCAC) Report: Top Ten Exascale Research Challenges , 2014 .

[49]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[50]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[51]  Santosh Pande,et al.  LADR: low-cost application-level detector for reducing silent output corruptions , 2018, HPDC.

[52]  Franck Cappello,et al.  FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks , 2020, IEEE Transactions on Parallel and Distributed Systems.

[53]  Abdul Rehman Anwer,et al.  GPU-trident: efficient modeling of error propagation in GPU programs , 2020, SC.

[54]  Gokcen Kestor,et al.  Understanding the propagation of transient errors in HPC applications , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[55]  Craig B. Zilles,et al.  A characterization of instruction-level error derating and its implications for error detection , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[56]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[57]  Bin Nie,et al.  Machine Learning Models for GPU Error Prediction in a Large Scale HPC System , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[58]  Minotaur , 2019, Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems.

[59]  Ravishankar K. Iyer,et al.  Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[60]  K. Mohror,et al.  DisCVar: discovering critical variables using algorithmic differentiation for transient faults , 2018, PPOPP.

[61]  Isil Dillig,et al.  Singularity: pattern fuzzing for worst case complexity , 2018, ESEC/SIGSOFT FSE.

[62]  Yang Liu,et al.  MEMLOCK: Memory Usage Guided Fuzzing , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).