Aloe: verifying reliability of approximate programs in the presence of recovery mechanisms

Modern hardware is becoming increasingly susceptible to silent data corruptions. As general methods for detection and recovery from errors are time and energy consuming, selective detection and recovery are promising alternatives for applications that have the freedom to produce results with a variable level of accuracy. Several programming languages have provided specialized constructs for expressing detection and recovery operations, but the existing static analyses of safety and quantitative analyses of programs do not have the proper support for such language constructs. This work presents Aloe, a quantitative static analysis of reliability of programs with recovery blocks - a construct that checks for errors, and if necessary, applies the corresponding recovery strategy. The analysis supports reasoning about both reliable and potentially unreliable detection and recovery mechanisms. It implements a novel precondition generator for recovery blocks, built on top of Rely, a state-of-the-art quantitative reliability analysis for imperative programs. Aloe can reason about programs with scalar and array expressions, if-then-else conditionals, and bounded loops without early exits. The analyzed computation is idempotent and the recovery code re-executes the original computation. We implemented Aloe and applied it to a set of eight programs previously used in approximate computing research. Our results present significantly higher reliability and scale better compared to the existing Rely analysis. Moreover, the end-to-end accuracy of the verified computations exhibits only small accuracy losses.

[1]  Somesh Jha,et al.  Static analysis and compiler design for idempotent processing , 2012, PLDI.

[2]  Martin C. Rinard Probabilistic accuracy bounds for fault-tolerant computations that discard tasks , 2006, ICS '06.

[3]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[4]  Xin Zhang,et al.  FlexJava: language support for safe and modular approximate programming , 2015, ESEC/SIGSOFT FSE.

[5]  Algirdas Avizienis,et al.  Software Fault Tolerance , 1989, IFIP Congress.

[6]  Hadi Esmaeilzadeh,et al.  AxBench: A Multiplatform Benchmark Suite for Approximate Computing , 2017, IEEE Design & Test.

[7]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[8]  Song Liu,et al.  Flikker: saving DRAM refresh-power through critical data partitioning , 2011, ASPLOS XVI.

[9]  Joel R. Sklaroff,et al.  Redundancy Management Technique for Space Shuttle Computers , 1976, IBM J. Res. Dev..

[10]  Michael Carbin,et al.  Leto: verifying application-specific hardware fault tolerance with programmable execution models , 2018, Proc. ACM Program. Lang..

[11]  Dan Grossman,et al.  Probability type inference for flexible approximate programming , 2015, OOPSLA.

[12]  Ravishankar K. Iyer,et al.  An end-to-end approach for the automatic derivation of application-aware error detectors , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[13]  Darko Marinov,et al.  Minotaur: Adapting Software Testing Techniques for Hardware Errors , 2019, ASPLOS.

[14]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[15]  Huiyang Zhou,et al.  Anomaly-based bug prediction, isolation, and validation: an automated approach for software debugging , 2009, ASPLOS.

[16]  Scott A. Mahlke,et al.  Input responsiveness: using canary inputs to dynamically steer approximation , 2016, PLDI.

[17]  M. Rinard Energy-Efficient Approximate Computation in Topaz , 2014 .

[18]  Josep Torrellas,et al.  Replica: A Wireless Manycore for Communication-Intensive and Approximate Data , 2019, ASPLOS.

[19]  Omer Khan,et al.  CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores , 2015, 2015 IEEE International Symposium on Workload Characterization.

[20]  Shuvendu K. Lahiri,et al.  Verifying Relative Safety, Accuracy, and Termination for Program Approximations , 2016, Journal of Automated Reasoning.

[21]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[22]  Rakesh Kumar,et al.  VideoChef: Efficient Approximation for Streaming Video Processing Pipelines , 2018, USENIX Annual Technical Conference.

[23]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[24]  Martin C. Rinard,et al.  Proving acceptability properties of relaxed nondeterministic approximate programs , 2012, PLDI.

[25]  Martin C. Rinard,et al.  Chisel: reliability- and accuracy-aware optimization of approximate computational kernels , 2014, OOPSLA.

[26]  Karthikeyan Sankaralingam,et al.  Relax: an architectural framework for software recovery of hardware faults , 2010, ISCA.

[27]  Eric Cheng,et al.  CLEAR: Cross-layer exploration for architecting resilience: Combining hardware and software techniques to tolerate soft errors in processor cores , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[28]  Martin C. Rinard,et al.  Verifying quantitative reliability for programs that execute on unreliable hardware , 2013, OOPSLA.

[29]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[30]  Ravishankar K. Iyer,et al.  SymPLFIED: Symbolic program-level fault injection and error detection framework , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[31]  Martin C. Rinard,et al.  Verified integrity properties for safe approximate program transformations , 2013, PEPM '13.

[32]  Woongki Baek,et al.  Green: a framework for supporting energy-conscious programming using controlled approximation , 2010, PLDI '10.

[33]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[34]  Martin C. Rinard,et al.  Approximate computation with outlier detection in Topaz , 2015, OOPSLA.

[35]  Jinsuk Chung,et al.  Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Saurabh Bagchi,et al.  Phase-aware optimization in approximate computing , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[37]  Guanpeng Li,et al.  Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[38]  Sasa Misailovic,et al.  Verifying safety and accuracy of approximate parallel programs via canonical sequentialization , 2019, Proc. ACM Program. Lang..

[39]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multi-threading alternatives , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[40]  Todd M. Austin Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation , 2003, SBCCI '06.

[41]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[42]  Oded Goldreich,et al.  Introduction to Property Testing , 2017 .

[43]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.