Special Session: Operating Systems under test: an overview of the significance of the operating system in the resiliency of the computing continuum

The computing continuum’s actual trend is facing a growth in terms of devices with any degree of computational capability. Those devices may or may not include a full-stack, including the Operating System layer and the Application layer, or just facing pure bare-metal solutions. In either case, the reliability of the full system stack has to be guaranteed. It is crucial to provide data regarding the impact of faults at all system stack levels and potential hardening solutions to design highly resilient systems. While most of the work usually concentrates on the application reliability, the special session aims to provide a deep comprehension of the impact on the reliability of an embedded system when faults in the hardware substrate of the system stack surface at the Operating System layer. For this reason, we will cover a comparison from an application perspective when hardware faults happen in bare metal vs. real-time OS vs. general-purpose OS. Then we will go deeper within a FreeRTOS to evaluate the contribution of all parts of the OS. Eventually, the Special Session will propose some hardening techniques at the Operating System level by exploiting the scheduling capabilities.

[1]  Tatsuhiro Tsuchiya,et al.  A new fault-tolerant scheduling technique for real-time multiprocessor systems , 1995, Proceedings Second International Workshop on Real-Time Computing Systems and Applications.

[2]  Bharadwaj Veeravalli,et al.  On the Design of Fault-Tolerant Scheduling Strategies Using Primary-Backup Approach for Computational Grids with Low Replication Costs , 2009, IEEE Transactions on Computers.

[3]  Elena Dubrova,et al.  Fault-Tolerant Design , 2013 .

[4]  Kenli Li,et al.  A Reliability-aware Task Scheduling Algorithm Based on Replication on Heterogeneous Computing Systems , 2017, Journal of Grid Computing.

[5]  Martin Naedele Fault-tolerant real-time scheduling under execution time constraints , 1999, Proceedings Sixth International Conference on Real-Time Computing Systems and Applications. RTCSA'99 (Cat. No.PR00306).

[6]  Jacob A. Abraham,et al.  FERRARI: A Flexible Software-Based Fault and Error Injection System , 1995, IEEE Trans. Computers.

[7]  Ricardo Reis,et al.  A fast and scalable fault injection framework to evaluate multi/many-core soft error reliability , 2015, 2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS).

[8]  Stefano Di Carlo,et al.  ReDO: Cross-Layer Multi-Objective Design-Exploration Framework for Efficient Soft Error Resilient Systems , 2018, IEEE Transactions on Computers.

[9]  Sophie Duzellier,et al.  Radiation effects on electronic devices in space , 2005 .

[10]  Seyed Ghassem Miremadi,et al.  A fast, flexible, and easy-to-develop FPGA-based fault injection technique , 2014, Microelectron. Reliab..

[11]  A. Bosio,et al.  SyRA: Early System Reliability Analysis for Cross-Layer Soft Errors Resilience in Memory Arrays of Microprocessor Systems , 2019, IEEE Transactions on Computers.

[12]  Muhammad Fayyaz,et al.  Fault-Tolerant Distributed approach to satellite On-Board Computer design , 2014, 2014 IEEE Aerospace Conference.

[13]  Enrico Zio,et al.  A New Analytical Approach for Interval Availability Analysis of Markov Repairable Systems , 2018, IEEE Transactions on Reliability.

[14]  Petr Dobiás Online Fault Tolerant Task Scheduling for Real-Time Multiprocessor Embedded Systems. (Contribution à l'ordonnancement dynamique, tolérant aux fautes, de tâches pour les systèmes embarqués temps-réel multiprocesseurs) , 2020 .

[15]  A. Singh,et al.  Fault-tolerant systems , 1990, Computer.

[16]  Major Singh Goraya,et al.  A framework for priority based task execution in the distributed computing environment , 2015, 2015 International Conference on Signal Processing, Computing and Control (ISPCC).

[17]  Sanjay J. Patel,et al.  Examining ACE analysis reliability estimates using fault-injection , 2007, ISCA '07.

[18]  Alessandro Savino,et al.  On the Analysis of Real-time Operating System Reliability in Embedded Systems , 2020, 2020 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT).

[19]  Giorgio Di Natale,et al.  Cross-layer early reliability evaluation: Challenges and promises , 2014, 2014 IEEE 20th International On-Line Testing Symposium (IOLTS).

[20]  Massimo Violante,et al.  Software-Implemented Hardware Fault Tolerance , 2010 .

[21]  Emmanuel Casseau,et al.  Evaluation of Fault Tolerant Online Scheduling Algorithms for CubeSats , 2020, 2020 23rd Euromicro Conference on Digital System Design (DSD).

[22]  Henrique Madeira,et al.  Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers , 1998, IEEE Trans. Software Eng..

[23]  Manjeet Singh Performance analysis of checkpoint based efficient failure-aware scheduling algorithm , 2017, 2017 International Conference on Computing, Communication and Automation (ICCCA).

[24]  Robert Baumann,et al.  Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.

[25]  Qiang Xu,et al.  Approximate Computing: A Survey , 2016, IEEE Design & Test.

[26]  Risat Mahmud Pathan,et al.  Real-time scheduling algorithm for safety-critical systems on faulty multicore environments , 2016, Real-Time Systems.

[27]  Ricardo Reis,et al.  Analyzing the impact of using pthreads versus OpenMP under fault injection in ARM Cortex-A9 dual-core , 2016, 2016 16th European Conference on Radiation and Its Effects on Components and Systems (RADECS).

[28]  Alessandro Savino,et al.  Cross-layer reliability evaluation, moving from the hardware architecture to the system level: A CLERECO EU project overview , 2015, Microprocess. Microsystems.

[29]  Jianxiong Xiao,et al.  DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  David I. August,et al.  Design and evaluation of hybrid fault-detection systems , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[31]  Rami G. Melhem,et al.  Fault-Tolerance Through Scheduling of Aperiodic Tasks in Hard Real-Time Multiprocessor Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[32]  Dakai Zhu,et al.  On Reliability Management of Energy-Aware Real-Time Systems Through Task Replication , 2017, IEEE Transactions on Parallel and Distributed Systems.

[33]  Giorgio Di Natale,et al.  Memory-Aware Design Space Exploration for Reliability Evaluation in Computing Systems , 2019, Journal of Electronic Testing.

[34]  Xu Zhou,et al.  Fault-Tolerant Dynamic Rescheduling for Heterogeneous Computing Systems , 2015, Journal of Grid Computing.

[35]  D. Gizopoulos,et al.  Cross-Layer Reliability of Computing Systems , 2020 .