Verifying quantitative reliability for programs that execute on unreliable hardware

Emerging high-performance architectures are anticipated to contain unreliable components that may exhibit soft errors, which silently corrupt the results of computations. Full detection and masking of soft errors is challenging, expensive, and, for some applications, unnecessary. For example, approximate computing applications (such as multimedia processing, machine learning, and big data analytics) can often naturally tolerate soft errors. We present Rely a programming language that enables developers to reason about the quantitative reliability of an application -- namely, the probability that it produces the correct result when executed on unreliable hardware. Rely allows developers to specify the reliability requirements for each value that a function produces. We present a static quantitative reliability analysis that verifies quantitative requirements on the reliability of an application, enabling a developer to perform sound and verified reliability engineering. The analysis takes a Rely program with a reliability specification and a hardware specification that characterizes the reliability of the underlying hardware components and verifies that the program satisfies its reliability specification when executed on the underlying unreliable hardware platform. We demonstrate the application of quantitative reliability analysis on six computations implemented in Rely.

[1]  Corina S. Pasareanu,et al.  Reliability analysis in Symbolic PathFinder , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[2]  Surendra Byna,et al.  Exploiting the forgiving nature of applications for scalable parallel execution , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[3]  Martin C. Rinard,et al.  Acceptability-oriented computing , 2003, OOPSLA '03.

[4]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[5]  Donald Yeung,et al.  Application-Level Correctness and its Impact on Fault Tolerance , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[6]  Herbert Wiklicky,et al.  Concurrent constraint programming: towards probabilistic abstract interpretation , 2000, PPDP '00.

[7]  Song Liu,et al.  Flikker: saving DRAM refresh-power through critical data partitioning , 2011, ASPLOS XVI.

[8]  Martin C. Rinard,et al.  Proving acceptability properties of relaxed nondeterministic approximate programs , 2012, PLDI.

[9]  Martin C. Rinard,et al.  Automatically identifying critical input regions and code in applications , 2010, ISSTA '10.

[10]  Henry Hoffmann,et al.  Dynamic knobs for responsive power-aware computing , 2011, ASPLOS XVI.

[11]  Henry Hoffmann,et al.  Patterns and statistical analysis for understanding reduced resource computing , 2010, OOPSLA.

[12]  Quinn Jacobson,et al.  ERSA: error resilient system architecture for probabilistic applications , 2010, DATE 2010.

[13]  Manuel Blum,et al.  Self-testing/correcting with applications to numerical problems , 1990, STOC '90.

[14]  Daniel M. Roy,et al.  Probabilistically Accurate Program Transformations , 2011, SAS.

[15]  Michael J. A. Smith Probabilistic Abstract Interpretation of Imperative Programs using Truncated Normal Distributions , 2008, Electron. Notes Theor. Comput. Sci..

[16]  Martin C. Rinard Probabilistic accuracy bounds for fault-tolerant computations that discard tasks , 2006, ICS '06.

[17]  Benjamin Grégoire,et al.  Formal certification of code-based cryptographic proofs , 2009, POPL '09.

[18]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[19]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS 2010.

[20]  Martin C. Rinard,et al.  Exploring the acceptability envelope , 2005, OOPSLA '05.

[21]  Neeraj Suri,et al.  On the placement of software mechanisms for detection of data errors , 2002, Proceedings International Conference on Dependable Systems and Networks.

[22]  Alan Edelman,et al.  Language and compiler support for auto-tuning variable-accuracy algorithms , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[23]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[24]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[25]  Françoise Chaitin-Chatelin,et al.  Lectures on finite precision computations , 1996, Software, environments, tools.

[26]  José Nuno Oliveira,et al.  Calculating fault propagation in functional programs , 2013 .

[27]  Michael D. Ernst,et al.  Automatically patching errors in deployed software , 2009, SOSP '09.

[28]  Amir Pnueli,et al.  Translation Validation , 1998, TACAS.

[29]  Douglas L. Jones,et al.  Scalable stochastic processors , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[30]  Gilles Barthe,et al.  Probabilistic relational reasoning for differential privacy , 2012, POPL '12.

[31]  David Walker,et al.  Fault-tolerant typed assembly language , 2007, PLDI '07.

[32]  David Monniaux,et al.  Abstract Interpretation of Probabilistic Semantics , 2000, SAS.

[33]  Daniel M. Roy,et al.  Enhancing Server Availability and Security Through Failure-Oblivious Computing , 2004, OSDI.

[34]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[35]  Vivek Sarkar,et al.  Array SSA form and its use in parallelization , 1998, POPL '98.

[36]  Martin C. Rinard,et al.  Bolt: on-demand infinite loop escape in unmodified binaries , 2012, OOPSLA '12.

[37]  Xiangyu Zhang,et al.  Coalescing executions for fast uncertainty analysis , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[38]  Annabelle McIver,et al.  Probabilistic predicate transformers , 1996, TOPL.

[39]  Martin C. Rinard,et al.  Detecting and Escaping Infinite Loops with Jolt , 2011, ECOOP.

[40]  Krishna V. Palem,et al.  Ultra-Efficient (Embedded) SOC Architectures based on Probabilistic CMOS (PCMOS) Technology , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[41]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[42]  Karthikeyan Sankaralingam,et al.  Relax: an architectural framework for software recovery of hardware faults , 2010, ISCA.

[43]  Aviral Shrivastava,et al.  Mitigating soft error failures for multimedia applications by selective data protection , 2006, CASES '06.

[44]  Dexter Kozen,et al.  Semantics of probabilistic programs , 1979, 20th Annual Symposium on Foundations of Computer Science (sfcs 1979).

[45]  Henry Hoffmann,et al.  Quality of service profiling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[46]  João Gabriel Silva,et al.  Algorithm based fault tolerance versus result-checking for matrix computations , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[47]  Martin C. Rinard,et al.  Parallelizing Sequential Programs with Statistical Accuracy Tests , 2013, TECS.

[48]  Henry Hoffmann,et al.  Managing performance vs. accuracy trade-offs with loop perforation , 2011, ESEC/FSE '11.

[49]  Manuel Blum,et al.  Designing programs that check their work , 1989, STOC '89.

[50]  Sumit Gulwani,et al.  Proving programs robust , 2011, ESEC/FSE '11.

[51]  Luis Ceze,et al.  Architecture support for disciplined approximate programming , 2012, ASPLOS XVII.

[52]  Edsger W. Dijkstra,et al.  Guarded commands, nondeterminacy and formal derivation of programs , 1975, Commun. ACM.

[53]  Federico Olmedo,et al.  Probabilistic Reasoning for Differential Privacy , 2012 .

[54]  P. R. Harvey,et al.  Software fault tree analysis , 1983, J. Syst. Softw..

[55]  Krishna V. Palem,et al.  Energy aware computing through probabilistic switching: a study of limits , 2005, IEEE Transactions on Computers.

[56]  David Walker,et al.  Modular Protections against Non-control Data Attacks , 2011, CSF.

[57]  Martin C. Rinard Using early phase termination to eliminate load imbalances at barrier synchronization points , 2007, OOPSLA.

[58]  Patrick Cousot,et al.  Probabilistic Abstract Interpretation , 2012, ESOP.

[59]  Xiangyu Zhang,et al.  White box sampling in uncertain data processing enabled by program analysis , 2012, OOPSLA '12.

[60]  Benjamin C. Pierce,et al.  Distance makes the types grow stronger: a calculus for differential privacy , 2010, ICFP '10.

[61]  Martin C. Rinard,et al.  Verified integrity properties for safe approximate program transformations , 2013, PEPM '13.

[62]  Karthik Pattabiraman,et al.  Samurai: protecting critical data in unsafe languages , 2008, Eurosys '08.

[63]  Woongki Baek,et al.  Green: a framework for supporting energy-conscious programming using controlled approximation , 2010, PLDI '10.

[64]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[65]  Karthik Pattabiraman,et al.  Error detector placement for soft computation , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[66]  Nick Benton,et al.  Simple relational correctness proofs for static analyses and program transformations , 2004, POPL.

[67]  Nancy G. Leveson,et al.  The Use of Self Checks and Voting in Software Error Detection: An Empirical Study , 1990, IEEE Trans. Software Eng..

[68]  David Walker,et al.  Reasoning about Control Flow in the Presence of Transient Faults , 2008, SAS.

[69]  Zeyuan Allen Zhu,et al.  Randomized accuracy-aware program transformations for efficient approximate computations , 2012, POPL '12.

[70]  Sumit Gulwani,et al.  Static analysis for probabilistic programs: inferring whole program properties from finitely many paths , 2013, PLDI.

[71]  Gilles Barthe,et al.  A Formally Verified SSA-Based Middle-End - Static Single Assignment Meets CompCert , 2012, ESOP.