Transactional Encoding for Tolerating Transient Hardware Errors

The decreasing feature size of integrated circuits leads to less reliable hardware with higher likelihood for errors. Without adding additional failure detection and masking mechanisms, the next generations of CPUs would at least be unfit for executing mission- and safety-critical applications. One common approach is the replicated execution of programs on redundant cores, which is increasingly difficult considering that most programs are non-deterministic. To be able to detect and mask execution errors, one typically need to execute three copies of each thread. In this paper, we propose and evaluate transactional encoding, a novel approach to detect and mask transient hardware errors such that one can build safe applications on top of unreliable components. Transactional encoding relies on a combination of arithmetic codes for detecting transient hardware errors and transactional memory for recovery and tolerance of transient errors. We present a prototype software implementation that encodes applications using an LLVM-based compiler and executes them with a customized software transactional memory algorithm. Our evaluation shows that our system can successfully survive between 90-96% of transient hardware errors.

[1]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[2]  James R. Larus,et al.  Transactional Memory, 2nd edition , 2010, Transactional Memory.

[3]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[4]  Emery D. Berger,et al.  DieHard: probabilistic memory safety for unsafe languages , 2006, PLDI '06.

[5]  Christof Fetzer,et al.  Leveraging Transactional Memory for Energy-efficient Computing below Safe Operation Margins , 2013 .

[6]  Mateo Valero Cortés,et al.  FaulTM: Fault-Tolerance Using Hardware Transactional Memory , 2010 .

[7]  Christof Fetzer,et al.  SIListra Compiler: Building Reliable Systems with Unreliable Hardware (Poster paper) , 2011, DSN 2011.

[8]  S. Webber,et al.  The Stratus architecture , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[9]  Christof Fetzer,et al.  Software-Implemented Hardware Error Detection: Costs and Gains , 2010, 2010 Third International Conference on Dependability.

[10]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[11]  Mateo Valero,et al.  FaulTM-multi : Fault Tolerance for Multithreaded Applications Running on Transactional Memory Hardware , 2011 .

[12]  Christof Fetzer,et al.  Slice Your Bug: Debugging Error Detection Mechanisms Using Error Injection Slicing , 2010, 2010 European Dependable Computing Conference.

[13]  Daniel M. Roy,et al.  A dynamic technique for eliminating buffer overflow vulnerabilities (and other memory errors) , 2004, 20th Annual Computer Security Applications Conference.

[14]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[15]  Edward J. McCluskey,et al.  ED4I: Error Detection by Diverse Data and Duplicated Instructions , 2002, IEEE Trans. Computers.

[16]  Karthik Pattabiraman,et al.  Samurai: protecting critical data in unsafe languages , 2008, Eurosys '08.

[17]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[18]  Robert W. Horst,et al.  Multiple instruction issue in the NonStop Cyclone processor , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[19]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[20]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[21]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[22]  Samuel T. King,et al.  Recovery domains: an organizing principle for recoverable operating systems , 2009, ASPLOS.

[23]  Mateo Valero,et al.  SymptomTM: Symptom-Based Error Detection and Recovery Using Hardware Transactional Memory , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[24]  David I. August,et al.  Software-controlled fault tolerance , 2005, TACO.

[25]  Christof Fetzer,et al.  ANB- and ANBDmem-Encoding: Detecting Hardware Errors in Software , 2010, SAFECOMP.

[26]  Robert W. Horst,et al.  Multiple instruction issue in the NonStop cyclone processor , 1990, ISCA '90.

[27]  David Clark,et al.  Safety and Security Analysis of Object-Oriented Models , 2002, SAFECOMP.

[28]  Christof Fetzer,et al.  Transactional memory for dependable embedded systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W).

[29]  Ute Schiffel,et al.  Hardware error detection using AN-Codes , 2010 .

[30]  Milo M. K. Martin,et al.  Deconstructing Transactional Semantics: The Subtleties of Atomicity , 2005 .

[31]  Håkan Grahn,et al.  Transactional memory , 2010, J. Parallel Distributed Comput..

[32]  Michael L. Scott,et al.  Sandboxing transactional memory , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[33]  C. Fetzer,et al.  Hardware Failure Virtualization Via Software Encoded Processing , 2007, 2007 5th IEEE International Conference on Industrial Informatics.

[34]  P. Forin,et al.  VITAL CODED MICROPROCESSOR PRINCIPLES AND APPLICATION FOR VARIOUS TRANSIT SYSTEMS , 1990 .

[35]  David Blaauw,et al.  Error analysis for the support of robust voltage scaling , 2005, Sixth international symposium on quality electronic design (isqed'05).