Application and System-Level Software Fault Tolerance through Full System Restarts

Due to the growing performance requirements, embedded systems are increasingly more complex. Meanwhile, they are also expected to be reliable. Guaranteeing reliability on complex systems is very challenging. Consequently, there is a substantial need for designs that enable the use of unverified components such as real-time operating system~(RTOS) without requiring their correctness to guarantee safety. In this work, we propose a novel approach to design a controller that enables the system to restart and remain safe during and after the restart. Complementing this controller with a switching logic allows the system to use complex, unverified controller to drive the system as long as it does not jeopardize safety. Such a design also tolerates faults that occur in the underlying software layers such as RTOS and middleware and recovers from them through system-level restarts that reinitialize the software~(middleware, RTOS, and applications) from a read-only storage. Our approach is implementable using one commercial off-the-shelf~(COTS) processing unit. To demonstrate the efficacy of our solution, we fully implement a controller for a 3 degree of freedom~(3DOF) helicopter. We test the system by injecting various types of faults into the applications and RTOS and verify that the system remains safe.

[1]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[2]  George Candea,et al.  Crash-Only Software , 2003, HotOS.

[3]  Lui Sha,et al.  Using Simplicity to Control Complexity , 2001, IEEE Softw..

[4]  Paulo Tabuada,et al.  Verification and Control of Hybrid Systems , 2009 .

[5]  Lui Sha,et al.  The System-Level Simplex Architecture for Improved Real-Time Embedded System Safety , 2009, 2009 15th IEEE Real-Time and Embedded Technology and Applications Symposium.

[6]  Lui Sha,et al.  Real-Time Reachability for Verified Simplex Design , 2014, 2014 IEEE Real-Time Systems Symposium.

[7]  S TrivediKishor,et al.  A Comprehensive Model for Software Rejuvenation , 2005 .

[8]  Lui Sha,et al.  An Engineering Method for Safety Region Development , 1999 .

[9]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[10]  Gunther Reissig,et al.  Feedback Refinement Relations for the Synthesis of Symbolic Controllers , 2015, IEEE Transactions on Automatic Control.

[11]  George Candea,et al.  Recursive restartability: turning the reboot sledgehammer into a scalpel , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[12]  George Candea,et al.  Improving availability with recursive microreboots: a soft-state system case study , 2004, Perform. Evaluation.

[13]  Lui Sha,et al.  The Simplex Reference Model: Limiting Fault-Propagation Due to Unreliable Components in Cyber-Physical System Architectures , 2007, RTSS 2007.

[14]  Lui Sha,et al.  Real-Time Reachability for Verified Simplex Design , 2014, RTSS.

[15]  Marco Caccamo,et al.  S3A: secure system simplex architecture for enhanced security and robustness of cyber-physical systems , 2013, HiCoNS '13.

[16]  Feng Shi,et al.  Performance Evaluation of a Self-Maintained Memory Module , 2007, RTSS 2007.

[17]  Lui Sha,et al.  Evolving dependable real-time systems , 1996, 1996 IEEE Aerospace Applications Conference. Proceedings.

[18]  Lui Sha Dependable system upgrade , 1998, Proceedings 19th IEEE Real-Time Systems Symposium (Cat. No.98CB36279).

[19]  Heechul Yun,et al.  A Simplex Architecture for Intelligent and Safe Unmanned Aerial Vehicles , 2016, 2016 IEEE 22nd International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA).

[20]  Martin Höst,et al.  Development of Safety-Critical Software Systems Using Open Source Software -- A Systematic Map , 2014, 2014 40th EUROMICRO Conference on Software Engineering and Advanced Applications.

[21]  Edward A. Lee Cyber Physical Systems: Design Challenges , 2008, 2008 11th IEEE International Symposium on Object and Component-Oriented Real-Time Distributed Computing (ISORC).

[22]  Marco Caccamo,et al.  Reset-based recovery for real-time cyber-physical systems with temporal safety constraints , 2016, 2016 IEEE 21st International Conference on Emerging Technologies and Factory Automation (ETFA).

[23]  Paulo Tabuada,et al.  Computing Robust Controlled Invariant Sets of Linear Systems , 2016, IEEE Transactions on Automatic Control.

[24]  Kishor S. Trivedi,et al.  A comprehensive model for software rejuvenation , 2005, IEEE Transactions on Dependable and Secure Computing.

[25]  George Candea,et al.  JAGR: an autonomous self-recovering application server , 2003, 2003 Autonomic Computing Workshop.

[26]  Kishor S. Trivedi,et al.  Analysis of software rejuvenation using Markov Regenerative Stochastic Petri Net , 1995, Proceedings of Sixth International Symposium on Software Reliability Engineering. ISSRE'95.

[27]  Franco Blanchini,et al.  Set-theoretic methods in control , 2007 .