Transparent software replication and hardware monitoring leveraging modern System-on-Chip features

Modern Commercial-Off-The-Shelf (COTS) System on-Chip (SoC) devices like multi-core computers have a variety of built-in features like Direct Memory Access (DMA) engines or sophisticated debug units. Using COTS devices in safety-critical environments like avionics requires replication, which can be based on diverse hardware to mitigate faults such as design errors or similar hardware to compensate for permanent and transient hardware faults e.g. due to single-event effects. This paper presents a novel approach of building fault-tolerant board architectures using chip-built-in features like debug units and implementing replication of application software components without the need of adaptation of application software. The advantages of the presented approach are the ability (1) to build fault-tolerant architectures relatively cheaply out of COTS components and (2) to separate the functional program from fault-tolerance-related code and, hence, also to include legacy code transparently. A demonstrator using two modern multicore processors connected by PCIe and debug units proves the feasibility of the described conceptual approach. Additional performance measurements quantify the benefit over commonly deployed software-based approaches.

[1]  Jim McWha Development of the 777 Flight Control System , 2003 .

[2]  Y. C. Yeh,et al.  Safety critical avionics for the 777 primary flight controls system , 2001, 20th DASC. 20th Digital Avionics Systems Conference (Cat. No.01CH37219).

[3]  Jeff Rearick,et al.  Overview of Debug Standardization Activities , 2008, IEEE Design & Test of Computers.

[4]  Zeljko Zilic,et al.  An enhanced debug-aware network interface for Network-on-Chip , 2012, Thirteenth International Symposium on Quality Electronic Design (ISQED).

[5]  Alberto L. Sangiovanni-Vincentelli,et al.  Fault-Tolerant Distributed Deployment of Embedded Control Software , 2008, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[6]  Alan Burns,et al.  Replica Determinism and Flexible Scheduling in Hard Real-Time Dependable Systems , 2000, IEEE Trans. Computers.

[7]  Stephen Osder,et al.  Generic Faults and Architecture Design Considerations in Flight-Critical Systems , 1983 .

[8]  Martin Leucker,et al.  Runtime verification for multicore SoC with high-quality trace data , 2013, TODE.

[9]  Pascal Traverse,et al.  AIRBUS A320/A330/A340 electrical flight controls - A family of fault-tolerant systems , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[10]  C. Favre,et al.  Fly-by-wire for commercial aircraft: the Airbus experience , 1994 .

[11]  Roberto Mijat Better Trace for Better Software Introducing the new ARM CoreSight System Trace Macrocell and Trace Memory Controller , 2010 .

[12]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[13]  Nikhil Gupta,et al.  Microprocessor Evaluations for Safety-Critical, Real-Time Applications: Authority for Expenditure No. 43 Phase 4 Report , 2009 .

[14]  Michael Paulitsch,et al.  Leveraging Multi-core Computing Architectures in Avionics , 2012, 2012 Ninth European Dependable Computing Conference.

[15]  Harald Ruess,et al.  Non-functional Avionics Requirements , 2008, ISoLA.

[16]  H. Kopetz,et al.  Temporal composability [real-time embedded systems] , 2002 .

[17]  Klaus D. McDonald-Maier,et al.  Debug support for complex systems on-chip: a review , 2006 .

[18]  Jean Arlat,et al.  Tolerance of Design Faults , 2011, Dependable and Historic Computing.

[19]  Johan Karlsson,et al.  Fault injection-based assessment of aspect-oriented implementation of fault tolerance , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[20]  Roman Obermaisser Time-Triggered Communication , 2009, Networked Embedded Systems.

[21]  Pascal Fradet,et al.  Implementing fault-tolerance in real-time programs by automatic program transformations , 2008, TECS.

[22]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[23]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.