Leto: verifying application-specific hardware fault tolerance with programmable execution models

Researchers have recently designed a number of application-specific fault tolerance mechanisms that enable applications to either be naturally resilient to errors or include additional detection and correction steps that can bring the overall execution of an application back into an envelope for which an acceptable execution is eventually guaranteed. A major challenge to building an application that leverages these mechanisms, however, is to verify that the implementation satisfies the basic invariants that these mechanisms require---given a model of how faults may manifest during the application's execution. To this end we present Leto, an SMT-based automatic verification system that enables developers to verify their applications with respect to an execution model specification. Namely, Leto enables software and platform developers to programmatically specify the execution semantics of the underlying hardware system as well as verify assertions about the behavior of the application's resulting execution. In this paper, we present the Leto programming language and its corresponding verification system. We also demonstrate Leto on several applications that leverage application-specific fault tolerance

[1]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[2]  K. Rustan M. Leino,et al.  The Spec# Programming System: An Overview , 2004, CASSIS.

[3]  Kurt B. Ferreira,et al.  Fault-tolerant iterative methods via selective reliability. , 2011 .

[4]  Earl E. Swartzlander,et al.  Truncated error correction for flexible approximate multiplication , 2012, 2012 Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR).

[5]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[6]  Keun Soo Yim Characterization of Impact of Transient Faults and Detection of Data Corruption Errors in Large-Scale N-Body Programs Using Graphics Processing Units , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[7]  Karthik Pattabiraman,et al.  Error Detector Placement for Soft Computing Applications , 2016, TECS.

[8]  Sumit Gulwani,et al.  Continuity analysis of programs , 2010, POPL '10.

[9]  Bernd Finkbeiner,et al.  Fields of Logic and Computation , 2011 .

[10]  David Walker,et al.  Faulty Logic: Reasoning about Fault Tolerant Programs , 2010, ESOP.

[11]  Mehdi Baradaran Tahoori,et al.  Numerical Defect Correction as an Algorithm-Based Fault Tolerance Technique for Iterative Solvers , 2011, 2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing.

[12]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[13]  Martin C. Rinard,et al.  Verified integrity properties for safe approximate program transformations , 2013, PEPM '13.

[14]  Doe Hyun Yoon,et al.  Memory mapped ECC: low-cost error protection for last level caches , 2009, ISCA '09.

[15]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[16]  Richard W. Vuduc,et al.  Self-stabilizing iterative solvers , 2013, ScalA '13.

[17]  Dae-Hyun Kim,et al.  Architectural Support for Mitigating Row Hammering in DRAM Memories , 2015, IEEE Computer Architecture Letters.

[18]  Ravishankar K. Iyer,et al.  Error Behavior Comparison of Multiple Computing Systems: A Case Study Using Linux on Pentium, Solaris on SPARC, and AIX on POWER , 2008, 2008 14th IEEE Pacific Rim International Symposium on Dependable Computing.

[19]  M. Baze,et al.  Comparison of error rates in combinational and sequential logic , 1997 .

[20]  Amber Roy-Chowdhury,et al.  Algorithm-based fault location and recovery for matrix computations , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[21]  Gilles Barthe,et al.  Relational Verification Using Product Programs , 2011, FM.

[22]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[23]  Henry Hoffmann,et al.  Dynamic knobs for responsive power-aware computing , 2011, ASPLOS XVI.

[24]  Thiago Santini,et al.  Effectiveness of Software-Based Hardening for Radiation-Induced Soft Errors in Real-Time Operating Systems , 2017, ARCS.

[25]  Martin C. Rinard,et al.  Chisel: reliability- and accuracy-aware optimization of approximate computational kernels , 2014, OOPSLA.

[26]  Sri Parameswaran,et al.  Processor Design for Soft Errors , 2016, ACM Comput. Surv..

[27]  Shuvendu K. Lahiri,et al.  SYMDIFF: A Language-Agnostic Semantic Diff Tool for Imperative Programs , 2012, CAV.

[28]  Bertrand Meyer,et al.  Inferring Loop Invariants Using Postconditions , 2010, Fields of Logic and Computation.

[29]  Ivan R. Linscott,et al.  LEAP: Layout Design through Error-Aware Transistor Positioning for soft-error resilient sequential cell design , 2010, 2010 IEEE International Reliability Physics Symposium.

[30]  Sriram Krishnamoorthy,et al.  Towards Resiliency Evaluation of Vector Programs , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[31]  Karthik Pattabiraman,et al.  BLOCKWATCH: Leveraging similarity in parallel programs for error detection , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[32]  Thomas Hérault,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.

[33]  Nick Benton,et al.  Simple relational correctness proofs for static analyses and program transformations , 2004, POPL.

[34]  Martin C. Rinard,et al.  Proving acceptability properties of relaxed nondeterministic approximate programs , 2012, PLDI.

[35]  Michael Carbin,et al.  Verifying Programs Under Custom Application-Specific Execution Models , 2018, ArXiv.

[36]  Earl E. Swartzlander,et al.  Truncated Logarithmic Approximation , 2013, 2013 IEEE 21st Symposium on Computer Arithmetic.

[37]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[38]  Christopher Mozak,et al.  Westmere: A family of 32nm IA processors , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[39]  David J. Lu Watchdog Processors and Structural Integrity Checking , 1982, IEEE Transactions on Computers.

[40]  Vivek Sarkar,et al.  Software challenges in extreme scale systems , 2009 .

[41]  K. Rustan M. Leino,et al.  Houdini, an Annotation Assistant for ESC/Java , 2001, FME.

[42]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[43]  Bronis R. de Supinski,et al.  Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.

[44]  Bertrand Meyer,et al.  Eiffel: The Language , 1991 .

[45]  Henry Hoffmann,et al.  Quality of service profiling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[46]  Amber Roy-Chowdhury,et al.  Algorithm-Based Fault Location and Recovery for Matrix Computations on Multiprocessor Systems , 1996, IEEE Trans. Computers.

[47]  Narayanan Vijaykrishnan,et al.  SEAT-LA: a soft error analysis tool for combinational logic , 2006, 19th International Conference on VLSI Design held jointly with 5th International Conference on Embedded Systems Design (VLSID'06).

[48]  Paolo A. Aseron,et al.  A 45 nm Resilient Microprocessor Core for Dynamic Variation Tolerance , 2011, IEEE Journal of Solid-State Circuits.

[49]  Yuki Yoshikawa,et al.  High-level synthesis for multi-cycle transient fault tolerant datapaths , 2011, 2011 IEEE 17th International On-Line Testing Symposium.

[50]  Xin Zhang,et al.  FlexJava: language support for safe and modular approximate programming , 2015, ESEC/SIGSOFT FSE.

[51]  R. Wong,et al.  Single-Event Performance and Layout Optimization of Flip-Flops in a 28-nm Bulk Technology , 2013, IEEE Transactions on Nuclear Science.

[52]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[53]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[54]  C. Jacobi Ueber eine neue Auflösungsart der bei der Methode der kleinsten Quadrate vorkommenden lineären Gleichungen , 1845 .

[55]  Chirag Jain,et al.  A Self-Correcting Connected Components Algorithm , 2016, FTXS@HPDC.

[56]  Isil Dillig,et al.  Cartesian hoare logic for verifying k-safety properties , 2016, PLDI.

[57]  Ravishankar K. Iyer,et al.  Measurement-based analysis of fault and error sensitivities of dynamic memory , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[58]  K.A. Bowman,et al.  Energy-Efficient and Metastability-Immune Resilient Circuits for Dynamic Variation Tolerance , 2009, IEEE Journal of Solid-State Circuits.

[59]  Martin C. Rinard Probabilistic accuracy bounds for fault-tolerant computations that discard tasks , 2006, ICS '06.

[60]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[61]  N. Seifert,et al.  Robust system design with built-in soft-error resilience , 2005, Computer.

[62]  Liang Chen,et al.  An efficient probability framework for error propagation and correlation estimation , 2012, 2012 IEEE 18th International On-Line Testing Symposium (IOLTS).

[63]  Martin C. Rinard,et al.  Automatically identifying critical input regions and code in applications , 2010, ISSTA '10.

[64]  Sarita V. Adve,et al.  Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[65]  J. S. Kauppila,et al.  Heavy Ion SEU Test Data for 32nm SOI Flip-Flops , 2015, 2015 IEEE Radiation Effects Data Workshop (REDW).

[66]  David Blaauw,et al.  Computing the Soft Error Rate of a Combinational Logic Circuit Using Parameterized Descriptors , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[67]  Martin C. Rinard,et al.  Verifying quantitative reliability for programs that execute on unreliable hardware , 2013, OOPSLA.

[68]  Brian Demsky,et al.  Self-stabilizing Java , 2012, PLDI '12.

[69]  Shuvendu K. Lahiri,et al.  Verifying Relative Safety, Accuracy, and Termination for Program Approximations , 2016, Journal of Automated Reasoning.

[70]  Sumit Gulwani,et al.  Proving programs robust , 2011, ESEC/FSE '11.

[71]  Nikolaj Bjørner,et al.  Z3: An Efficient SMT Solver , 2008, TACAS.

[72]  Padma Raghavan,et al.  Fault tolerant preconditioned conjugate gradient for sparse linear system solution , 2012, ICS '12.

[73]  A. Johnston Scaling and Technology Issues for Soft Error Rates , 2000 .

[74]  C. Metra,et al.  A model for transient fault propagation in combinatorial logic , 2003, 9th IEEE On-Line Testing Symposium, 2003. IOLTS 2003..

[75]  Chris Fallin,et al.  Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[76]  Dan Grossman,et al.  Probability type inference for flexible approximate programming , 2015, OOPSLA.

[77]  Daniel M. Roy,et al.  Probabilistically Accurate Program Transformations , 2011, SAS.

[78]  Bor-Yuh Evan Chang,et al.  Boogie: A Modular Reusable Verifier for Object-Oriented Programs , 2005, FMCO.

[79]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[80]  Wei Wu,et al.  Energy-efficient cache design using variable-strength error-correcting codes , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[81]  Lauretta O. Osho,et al.  Axiomatic Basis for Computer Programming , 2013 .

[82]  Reetuparna Das,et al.  ANVIL: Software-Based Protection Against Next-Generation Rowhammer Attacks , 2016, ASPLOS.

[83]  Ming Zhang,et al.  Combinational Logic Soft Error Correction , 2006, 2006 IEEE International Test Conference.

[84]  Ravishankar K. Iyer,et al.  Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.