Towards improved survivability in safety-critical systems

Performance demand of Critical Real-Time Embedded (CRTE) systems implementing safety-related system features grows at an exponential rate. Only modern semiconductor technologies can satisfy CRTE systems performance needs efficiently. However, those technologies lead to high failure rates, thus lowering survivability of chips to unacceptable levels for CRTE systems. This paper presents SESACS architecture (Surviving Errors in SAfety-Critical Systems), a paradigm shift in the design of CRTE systems. SESACS is a new system design methodology consisting of three main components: (i) a multicore hardware/firmware platform capable of detecting and diagnosing hardware faults of any type with minimal impact on the worst-case execution time (WCET), recovering quickly from errors, and properly reconfiguring the system so that the resulting system exhibits a predictable and analyzable degradation in WCET; (ii) a set of analysis methods and tools to prove the timing correctness of the reconfigured system; and (iii) a white-box methodology and tools to prove the functional safety of the system and compliance with industry standards. This new design paradigm will deliver huge benefits to the embedded systems industry for several decades by enabling the use of more cost-effective multicore hardware platforms built on top of modern semiconductor technologies, thereby enabling higher performance, and reducing weight and power dissipation. This new paradigm will further extend the life of embedded systems, therefore, reducing warranty and early replacement costs.

[1]  R. Bulin The European Organisation for the Safety of Air Navigation — Eurocontrol , 1976 .

[2]  Algirdas Avizienis,et al.  Fault Tolerance by Design Diversity: Concepts and Experiments , 1984, Computer.

[3]  A. Singh,et al.  Fault-tolerant systems , 1990, Computer.

[4]  John P. Hayes,et al.  Online BIST for Embedded Systems , 1998, IEEE Des. Test Comput..

[5]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[6]  L. Litwin,et al.  Error control coding , 2001 .

[7]  E. A. Bretz By-wire cars turn the corner , 2001 .

[8]  Donal Heffernan,et al.  Expanding Automotive Electronic Systems , 2002, Computer.

[9]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[10]  Sule Ozev,et al.  Tolerating hard faults in microprocessor array structures , 2004, International Conference on Dependable Systems and Networks, 2004.

[11]  Sule Ozev,et al.  A mechanism for online diagnosis of hard faults in microprocessors , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[12]  Riccardo Mariani,et al.  Cost-effective Approach to Error Detection for an Embedded Automotive Platform , 2006 .

[13]  Alfredo Benso,et al.  A Functional Verification based Fault Injection Environment , 2007, 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT 2007).

[14]  Jakob Engblom,et al.  The worst-case execution-time problem—overview of methods and survey of tools , 2008, TECS.

[15]  Srivaths Ravi,et al.  Systematic Software-Based Self-Test for Pipelined Processors , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[16]  Michail Maniatakos,et al.  Systematic Software-Based Self-Test for Pipelined Processors , 2008, IEEE Trans. Very Large Scale Integr. Syst..

[17]  Tullio Vardanega,et al.  Attacking the Sources of Unpredictability in the Instruction Cache Behavior , 2008 .

[18]  Jaume Abella,et al.  Low Vccmin fault-tolerant cache with highly predictable performance , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Tecnología do ar e espaço European Organisation for the Safety of Air Navigation , 2010 .

[20]  Yiannakis Sazeides,et al.  Performance-effective operation below Vcc-min , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[21]  Matteo Sonza Reorda,et al.  Microprocessor Software-Based Self-Testing , 2010, IEEE Design & Test of Computers.

[22]  Sarita V. Adve,et al.  Architectures for online error detection and recovery in multicore processors , 2011, 2011 Design, Automation & Test in Europe.

[23]  Riccardo Mariani,et al.  Towards functional-safe timing-dependable real-time architectures , 2011, 2011 IEEE 17th International On-Line Testing Symposium.

[24]  Francisco J. Cazorla,et al.  RVC: a mechanism for time-analyzable real-time processors with faulty caches , 2011, HiPEAC.

[25]  Jaume Abella,et al.  Hardware/software-based diagnosis of load-store queues using expandable activity logs , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[26]  Francisco J. Cazorla,et al.  RVC-based time-predictable faulty caches for safety-critical systems , 2011, 2011 IEEE 17th International On-Line Testing Symposium.

[27]  Doug Burger,et al.  Exploiting microarchitectural redundancy for defect tolerance , 2003, 2012 IEEE 30th International Conference on Computer Design (ICCD).