Software Approaches for In-time Resilience

Advances in semiconductor technology have enabled unprecedented growth in safety-critical applications. However, due to unabated scaling, the unreliability of the underlying hardware is only getting worse. For a lot of applications, just recovering from errors is not enough -- the latency between the occurrence of the fault to it's detection and recovery from the fault, i.e., in-time error resilience is of vital importance. This is especially true for real-time applications, where the timing of application events is a crucial part of the correctness of application. While software techniques for resilience are highly desirable since they can be flexibly applied, but achieving reliable, in-time software resilience is still an elusive goal. A new class of recent techniques have started to tackle this problem. This paper presents a succinct overview of existing software resilience techniques from the point-of-view of in-time resilience, and points out future challenges.

[1]  Amin Ansari,et al.  Encore: Low-cost, fine-grained transient fault recovery , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[3]  Aviral Shrivastava,et al.  EXPERT: Effective and flexible error protection by redundant multithreading , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[4]  Muhammad Shafique,et al.  The EDA challenges in the dark silicon era , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Christof Fetzer,et al.  ELZAR: Triple Modular Redundancy Using Intel AVX (Practical Experience Report) , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[6]  Norbert Wehn,et al.  Reliable on-chip systems in the nano-era: Lessons learnt and future trends , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[7]  Aviral Shrivastava,et al.  NEMESIS: A software approach for computing in presence of soft errors , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[8]  David I. August,et al.  Software-controlled fault tolerance , 2005, TACO.

[9]  Aviral Shrivastava,et al.  A Compiler Technique for Processor-Wide Protection From Soft Errors in Multithreaded Environments , 2018, IEEE Transactions on Reliability.

[10]  Aviral Shrivastava,et al.  InCheck: An in-application recovery scheme for soft errors , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[11]  Aviral Shrivastava,et al.  A software-level Redundant MultiThreading for Soft/Hard Error Detection and Recovery , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[12]  Aviral Shrivastava,et al.  nZDC: A compiler technique for near Zero Silent Data Corruption , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[13]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[14]  David I. August,et al.  Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.