Compensate or ignore? meeting control robustness requirements through adaptive soft-error handling

To avoid catastrophic events like unrecoverable system failures on mobile and embedded systems caused by soft-errors, software-based error detection and compensation techniques have been proposed. Methods like error-correction codes or redundant execution can offer high flexibility and allow for application-specific fault-tolerance selection without the needs of special hardware supports. However, such software-based approaches may lead to system overload due to the execution time overhead. An adaptive deployment of such techniques to meet both application requirements and system constraints is desired. From our case study, we observe that a control task can tolerate limited errors with acceptable performance loss. Such tolerance can be modeled as a (m,k) constraint which requires at least m correct runs out of any k consecutive runs to be correct. In this paper, we discuss how a given (m,k) constraint can be satisfied by adopting patterns of task instances with individual error detection and compensation capabilities. We introduce static strategies and provide a formal feasibility analysis for validation. Furthermore, we develop an adaptive scheme that extends our initial approach with online awareness that increases efficiency while preserving analysis results. The effectiveness of our method is shown in a real-world case study as well as for synthesized task sets.

[1]  Mahmut T. Kandemir,et al.  Compiler-directed instruction duplication for soft error detection , 2005, Design, Automation and Test in Europe.

[2]  Frank Slomka,et al.  Sensitivity Analysis of Dropped Samples for Performance-Oriented Controller Design , 2015, 2015 IEEE 18th International Symposium on Real-Time Distributed Computing.

[3]  David I. August,et al.  Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.

[4]  Lothar Thiele,et al.  A hybrid approach to cyber-physical systems verification , 2012, DAC Design Automation Conference 2012.

[5]  Linwei Niu,et al.  Energy minimization for real-time systems with (m,k)-guarantee , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[6]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[7]  David I. August,et al.  Software-controlled fault tolerance , 2005, TACO.

[8]  Christof Fetzer,et al.  AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware , 2009, SAFECOMP.

[9]  Gang Quan,et al.  Enhanced fixed-priority scheduling with (m,k)-firm guarantee , 2000, Proceedings 21st IEEE Real-Time Systems Symposium.

[10]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[11]  Aloysius K. Mok,et al.  A multiframe model for real-time tasks , 1996, 17th IEEE Real-Time Systems Symposium.

[12]  Diana Franklin,et al.  Efficient fault tolerance in multi-media applications through selective instruction replication , 2008, WREFT '08.

[13]  Muhammad Shafique,et al.  Leveraging variable function resilience for selective software reliability on unreliable hardware , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[14]  Jian-Jia Chen,et al.  Optimistic Reliability Aware Energy Management for Real-Time Tasks with Probabilistic Execution Times , 2008, 2008 Real-Time Systems Symposium.

[15]  Michael Engel,et al.  Unreliable yet useful - reliability annotations for data in cyber-physical systems , 2011, GI-Jahrestagung.

[16]  James W. Layland,et al.  Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment , 1989, JACM.

[17]  Alan Burns,et al.  Efficient Exact Schedulability Tests for Fixed Priority Real-Time Systems , 2008, IEEE Transactions on Computers.

[18]  M. Sonza-Reorda,et al.  A software fault tolerance method for safety-critical systems: effectiveness and drawbacks , 2002, Proceedings. 15th Symposium on Integrated Circuits and Systems Design.

[19]  Karl Henrik Johansson,et al.  Predictive compensation for communication outages in networked control systems , 2008, 2008 47th IEEE Conference on Decision and Control.

[20]  John P. Lehoczky,et al.  The rate monotonic scheduling algorithm: exact characterization and average case behavior , 1989, [1989] Proceedings. Real-Time Systems Symposium.

[21]  Giorgio C. Buttazzo,et al.  Measuring the Performance of Schedulability Tests , 2005, Real-Time Systems.

[22]  Parameswaran Ramanathan,et al.  Overload Management in Real-Time Control Applications Using (m, k)-Firm Guarantee , 1999, IEEE Trans. Parallel Distributed Syst..

[23]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..