Minotaur: Adapting Software Testing Techniques for Hardware Errors

With the end of conventional CMOS scaling, efficient resiliency solutions are needed to address the increased likelihood of hardware errors. Silent data corruptions (SDCs) are especially harmful because they can create unacceptable output without the user's knowledge. Several resiliency analysis techniques have been proposed to identify SDC-causing instructions, but they remain too slow for practical use and/or sacrifice accuracy to improve analysis speed. We develop Minotaur, a novel toolkit to improve the speed and accuracy of resiliency analysis. The key insight behind Minotaur is that modern resiliency analysis has many conceptual similarities to software testing; therefore, adapting techniques from the rich software testing literature can lead to principled and significant improvements in resiliency analysis. Minotaur identifies and adapts four concepts from software testing: 1) it introduces the concept of input quality criteria for resiliency analysis and identifies PC coverage as a simple but effective criterion; 2) it creates (fast) minimized inputs from (slow) standard benchmark inputs, using the input quality criteria to assess the goodness of the created input; 3) it adapts the concept of test case prioritization to prioritize error injections and invoke early termination for a given instruction to speed up error-injection campaigns; and 4) it further adapts test case or input prioritization to accelerate SDC discovery across multiple inputs. We evaluate Minotaur by applying it to Approxilyzer, a state-of-the-art resiliency analysis tool. Minotaur's first three techniques speed up Approxilyzer's resiliency analysis by 10.3X (on average) for the workloads studied. Moreover, they identify 96% (on average) of all SDC-causing instructions explored, compared to 64% identified by Approxilyzer alone. Minotaur's fourth technique (input prioritization) enables identifying all SDC-causing instructions explored across multiple inputs at a speed 2.3X faster (on average) than analyzing each input independently for our workloads.

[1]  Hadi Esmaeilzadeh,et al.  AxGames: Towards Crowdsourcing Quality Target Determination in Approximate Computing , 2016, ASPLOS.

[2]  Natalie D. Enright Jerger,et al.  Doppelgänger: A cache for approximate computing , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Thomas Stanton,et al.  An Accurate Cross-Layer Approach for Online Architectural Vulnerability Estimation , 2016, ACM Trans. Archit. Code Optim..

[4]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[5]  Benjamin Carrion Schafer,et al.  Exposing Approximate Computing Optimizations at Different Levels: From Behavioral to Gate-Level , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[6]  Scott A. Mahlke,et al.  SAGE: Self-tuning approximation for graphics engines , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Asit K. Mishra,et al.  iACT: A Software-Hardware Framework for Understanding the Scope of Approximate Computing , 2014 .

[8]  Shubhendu S. Mukherjee,et al.  Measuring Architectural Vulnerability Factors , 2003, IEEE Micro.

[9]  Michel Dubois,et al.  Reliability-Aware Exceptions: Tolerating intermittent faults in microprocessor array structures , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[10]  Xiaodong Li,et al.  Online Estimation of Architectural Vulnerability Factor for Soft Errors , 2008, 2008 International Symposium on Computer Architecture.

[11]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[12]  Rajesh K. Gupta,et al.  SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[13]  Karthik Pattabiraman,et al.  Modeling Input-Dependent Error Propagation in Programs , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[14]  Woongki Baek,et al.  Green: a framework for supporting energy-conscious programming using controlled approximation , 2010, PLDI '10.

[15]  David Blaauw,et al.  Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation , 2003, MICRO.

[16]  Rakesh Kumar,et al.  VideoChef: Efficient Approximation for Streaming Video Processing Pipelines , 2018, USENIX Annual Technical Conference.

[17]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[18]  Marc Snir,et al.  FlipIt: An LLVM Based Fault Injector for HPC , 2014, Euro-Par Workshops.

[19]  Alex Groce,et al.  Evaluating non-adequate test-case reduction , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[20]  Huiyang Zhou,et al.  Unified Architectural Support for Soft-Error Protection or Software Bug Detection , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[21]  Dimitris Gizopoulos,et al.  MeRLiN: Exploiting dynamic instruction behavior for fast and accurate microarchitecture level reliability assessment , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[22]  Glenford J. Myers,et al.  Art of Software Testing , 1979 .

[23]  Fikret S. Gürgen,et al.  Collection and Analysis of a Parkinson Speech Dataset With Multiple Types of Sound Recordings , 2013, IEEE Journal of Biomedical and Health Informatics.

[24]  J WeyukerElaine,et al.  Selecting Software Test Data Using Data Flow Information , 1985 .

[25]  Xiangyu Zhang,et al.  Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Jie Han,et al.  Approximate computing: An emerging paradigm for energy-efficient design , 2013, 2013 18th IEEE European Test Symposium (ETS).

[27]  Sarita V. Adve,et al.  mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  John Sartori,et al.  Approximate Communication , 2018, ACM Comput. Surv..

[29]  Henrique S. Malvar,et al.  Approximate Storage of Compressed and Encrypted Videos , 2017, ASPLOS.

[30]  John Sartori,et al.  Architecting processors to allow voltage/reliability tradeoffs , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[31]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[32]  Xiangyu Li,et al.  PRISM: Predicting Resilience of GPU Applications Using Statistical Methods , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Martin C. Rinard,et al.  Verifying quantitative reliability for programs that execute on unreliable hardware , 2013, OOPSLA.

[34]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[35]  Karthikeyan Sankaralingam,et al.  Relax: an architectural framework for software recovery of hardware faults , 2010, ISCA.

[36]  Ansuman Banerjee,et al.  AutoSense: A Framework for Automated Sensitivity Analysis of Program Data , 2017, IEEE Transactions on Software Engineering.

[37]  Alex Groce,et al.  Cause Reduction for Quick Testing , 2014, 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation.

[38]  Qiang Xu,et al.  On quality trade-off control for approximate computing using iterative training , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[39]  Olaf Spinczyk,et al.  FAIL*: An Open and Versatile Fault-Injection Framework for the Assessment of Software-Implemented Hardware Fault Tolerance , 2015, 2015 11th European Dependable Computing Conference (EDCC).

[40]  Jishen Zhao,et al.  Approximate image storage with multi-level cell STT-MRAM main memory , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[41]  Pradip Bose,et al.  Impact of Software Approximations on the Resiliency of a Video Summarization System , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[42]  Abdullah Muzahid,et al.  Approximeter: Automatically finding and quantifying code sections for approximation , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[43]  A. J. KleinOsowski,et al.  MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research , 2002, IEEE Computer Architecture Letters.

[44]  Xin Zhang,et al.  FlexJava: language support for safe and modular approximate programming , 2015, ESEC/SIGSOFT FSE.

[45]  Shiao-Li Tsao,et al.  Domain-Specific Approximation for Object Detection , 2018, IEEE Micro.

[46]  Xin Zhang,et al.  ExpAX: A Framework for Automating Approximate Programming , 2014 .

[47]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[48]  Shubhendu S. Mukherjee,et al.  Perturbation-based Fault Screening , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[49]  Mark Harman,et al.  Regression testing minimization, selection and prioritization: a survey , 2012, Softw. Test. Verification Reliab..

[50]  Keshav Pingali,et al.  Proactive Control of Approximate Programs , 2016, ASPLOS.

[51]  Hadi Esmaeilzadeh,et al.  Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[52]  Ravishankar K. Iyer,et al.  Characterization of linux kernel behavior under errors , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[53]  Rahul Boyapati,et al.  APPROX-NoC: A data approximation framework for Network-on-Chip architectures , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[54]  Chundong Wang,et al.  ASAC: automatic sensitivity analysis for approximate computing , 2014, LCTES '14.

[55]  Eric Cheng,et al.  CLEAR: Cross-layer exploration for architecting resilience: Combining hardware and software techniques to tolerate soft errors in processor cores , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[56]  Karthik Pattabiraman,et al.  Modeling Soft-Error Propagation in Programs , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[57]  Guanpeng Li,et al.  Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[58]  Scott A. Mahlke,et al.  Input responsiveness: using canary inputs to dynamically steer approximation , 2016, PLDI.

[59]  Jörg Brauer,et al.  Source-Code-to-Object-Code Traceability Analysis for Avionics Software: Don't Trust Your Compiler , 2015, SAFECOMP.

[60]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[61]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[62]  Thierry Moreau,et al.  A Taxonomy of General Purpose Approximate Computing Techniques , 2018, IEEE Embedded Systems Letters.

[63]  Johan Karlsson,et al.  One Bit is (Not) Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[64]  Craig B. Zilles,et al.  A characterization of instruction-level error derating and its implications for error detection , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[65]  Lei Chen,et al.  CrashTest'ing SWAT: Accurate, gate-level evaluation of symptom-based resiliency solutions , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[66]  Cyrille Comar,et al.  Object and Source Coverage for Critical Applications with the C OUVERTURE Open Analysis Framework , 2010 .

[67]  Pradip Bose,et al.  Understanding Error Propagation in GPGPU Applications , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[68]  Ravishankar K. Iyer,et al.  SymPLFIED: Symbolic program-level fault injection and error detection framework , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[69]  Pradip Bose,et al.  Cross-layer system resilience at affordable power , 2014, 2014 IEEE International Reliability Physics Symposium.

[70]  Peizhen Guo,et al.  Potluck: Cross-Application Approximate Deduplication for Computation-Intensive Mobile Applications , 2018, ASPLOS.

[71]  Andreas Zeller,et al.  Simplifying and Isolating Failure-Inducing Input , 2002, IEEE Trans. Software Eng..

[72]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[73]  Sarita V. Adve,et al.  Accurate microarchitecture-level fault modeling for studying hardware faults , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[74]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2008, IEEE Micro.

[75]  QingPing Tan,et al.  SmartInjector: Exploiting intelligent fault injection for SDC rate analysis , 2013, 2013 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS).

[76]  Sarita V. Adve,et al.  GangES: Gang error simulation for hardware resiliency evaluation , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[77]  Alex Groce,et al.  Cause reduction: delta debugging, even without bugs , 2016, Softw. Test. Verification Reliab..

[78]  Stijn Eyerman,et al.  Reliability-Aware Scheduling on Heterogeneous Multicore Processors , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[79]  A. Jefferson Offutt,et al.  Introduction to Software Testing , 2008 .

[80]  Xiang Song,et al.  A FPGA Friendly Approximate Computing Framework with Hybrid Neural Networks: (Abstract Only) , 2018, FPGA.

[81]  Régis Leveugle,et al.  Statistical fault injection: Quantified error and confidence , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[82]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[83]  Bin Nie,et al.  Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[84]  Anne Marsden,et al.  International Organization for Standardization , 2014 .

[85]  Alex Groce,et al.  Using test case reduction and prioritization to improve symbolic execution , 2014, ISSTA 2014.

[86]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[87]  Rajesh K. Gupta,et al.  Reliability-Aware Data Placement for Heterogeneous Memory Architecture , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[88]  Luciano Baresi,et al.  An Introduction to Software Testing , 2006, FoVMT.

[89]  Alfredo Benso,et al.  Data criticality estimation in software applications , 2003, International Test Conference, 2003. Proceedings. ITC 2003..

[90]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[91]  Karthik Pattabiraman,et al.  Error detector placement for soft computation , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[92]  S AdveVikram,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008 .

[93]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[94]  M. Petró‐Turza,et al.  The International Organization for Standardization. , 2003 .

[95]  Xuejun Yang,et al.  Test-case reduction for C compiler bugs , 2012, PLDI.

[96]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[97]  Elaine J. Weyuker,et al.  An Applicable Family of Data Flow Testing Criteria , 1988, IEEE Trans. Software Eng..

[98]  Martin C. Rinard,et al.  Automatically identifying critical input regions and code in applications , 2010, ISSTA '10.

[99]  Sarita V. Adve,et al.  Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[100]  Junfeng Yang,et al.  DeepXplore: Automated Whitebox Testing of Deep Learning Systems , 2017, SOSP.

[101]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[102]  Ismail Akturk,et al.  On Approximate Speculative Lock Elision , 2018, IEEE Transactions on Multi-Scale Computing Systems.

[103]  Sarita V. Adve,et al.  Trace-based microarchitecture-level diagnosis of permanent hardware faults , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[104]  Massimo Violante,et al.  Soft-error detection using control flow assertions , 2003, Proceedings 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems.

[105]  Henry Hoffmann,et al.  Quality of service profiling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[106]  Michael S. Floyd,et al.  Fault - tolerant design of the IBM POWER6™ microprocessor , 2007, 2007 IEEE Hot Chips 19 Symposium (HCS).

[107]  Henry Hoffmann,et al.  Managing performance vs. accuracy trade-offs with loop perforation , 2011, ESEC/FSE '11.

[108]  Claus Braun,et al.  Pushing the limits: How fault tolerance extends the scope of approximate computing , 2016, 2016 IEEE 22nd International Symposium on On-Line Testing and Robust System Design (IOLTS).

[109]  Sarita V. Adve,et al.  Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.

[110]  Qiang Xu,et al.  ApproxQA: A unified quality assurance framework for approximate computing , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[111]  Betul Erdogdu Sakar,et al.  Improved spiral test using digitized graphics tablet for monitoring Parkinson's disease , 2014 .

[112]  Karthik Pattabiraman,et al.  Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[113]  Martin C. Rinard,et al.  Chisel: reliability- and accuracy-aware optimization of approximate computational kernels , 2014, OOPSLA.

[114]  John Sartori,et al.  Branch and Data Herding: Reducing Control and Memory Divergence for Error-Tolerant GPU Applications , 2013, IEEE Trans. Multim..

[115]  Nam Sung Kim,et al.  Decoupled Control and Data Processing for Approximate Near-Threshold Voltage Computing , 2015, IEEE Micro.

[116]  Dan Grossman,et al.  Probability type inference for flexible approximate programming , 2015, OOPSLA.

[117]  Ravishankar K. Iyer,et al.  An end-to-end approach for the automatic derivation of application-aware error detectors , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[118]  David R. Kaeli,et al.  Eliminating microarchitectural dependency from Architectural Vulnerability , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[119]  Bo Fang,et al.  ePVF: An Enhanced Program Vulnerability Factor Methodology for Cross-Layer Resilience Analysis , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[120]  Ravishankar K. Iyer,et al.  Dynamic Derivation of Application-Specific Error Detectors and their Implementation in Hardware , 2006, 2006 Sixth European Dependable Computing Conference.

[121]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[122]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[123]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[124]  Huiyang Zhou,et al.  Anomaly-based bug prediction, isolation, and validation: an automated approach for software debugging , 2009, ASPLOS.