SuperGlue: IDL-Based, System-Level Fault Tolerance for Embedded Systems

As the processor feature sizes shrink, mitigating faults in low level system services has become a critical aspect of dependable system design. In this paper we introduce SuperGlue, an interface description language (IDL) and compiler for recovery from transient faults in a component-based operating system. SuperGlue generates code for interface-driven recovery that uses commodity hardware isolation, micro-rebooting, and interface-directed fault recovery to provide predictable and efficient recovery from faults that impact low-level system services. SuperGlue decreases the amount of recovery code system designers need to implement by an order of magnitude, and replaces it with declarative specifications. We evaluate SuperGlue with a fault injection campaign in low-level system components (e.g., memory mapping manager and scheduler). Additionally, we evaluate the performance of SuperGlue in a web-server application. Results show that SuperGlue improves system reliability with only a small performance degradation of 11.84%.

[1]  Gabriel Parmer,et al.  C'Mon: a predictable monitoring infrastructure for system-level latent fault detection and recovery , 2015, 21st IEEE Real-Time and Embedded Technology and Applications Symposium.

[2]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[3]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[4]  Shubu Mukherjee,et al.  Architecture Design for Soft Errors , 2008 .

[5]  Jochen Liedtke,et al.  On micro-kernel construction , 1995, SOSP.

[6]  Johan Karlsson,et al.  Toward dependability benchmarking of partitioning operating systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[7]  Ravishankar K. Iyer,et al.  An experimental study of soft errors in microprocessors , 2005, IEEE Micro.

[8]  Massimo Violante,et al.  An accurate analysis of the effects of soft errors in the instruction and data caches of a pipelined microprocessor , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[9]  Gabriel Parmer The Case for Thread Migration : Predictable IPC in a Customizable and Reliable OS , 2010 .

[10]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[11]  Michael Nicolaidis Time redundancy based soft-error tolerance to rescue nanometer technologies , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[12]  Gernot Heiser,et al.  From L3 to seL4 what have we learnt in 20 years of L4 microkernels? , 2013, SOSP.

[13]  Sanjay J. Patel,et al.  ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2006, IEEE Trans. Dependable Secur. Comput..

[14]  Jay Lepreau,et al.  Evolving Mach 3.0 to A Migrating Thread Model , 1994, USENIX Winter.

[15]  Gedare Bloom,et al.  CBufs: efficient, system-wide memory management and sharing , 2016, ISMM.

[16]  M SwiftMichael,et al.  Improving the reliability of commodity operating systems , 2003 .

[17]  Michael Stumm,et al.  Otherworld: giving applications a chance to survive OS kernel crashes , 2010, EuroSys '10.

[18]  Eric Eide,et al.  Flick: a flexible, optimizing IDL compiler , 1997, PLDI '97.

[19]  Isabelle Puaut,et al.  Experimental evaluation of the fail-silent behavior of a distributed real-time run-time support built from COTS components , 2001, 2001 International Conference on Dependable Systems and Networks.

[20]  Jean Arlat,et al.  Formal specification for building robust real-time microkernels , 2000, Proceedings 21st IEEE Real-Time Systems Symposium.

[21]  David Wright,et al.  Probabilistic scheduling guarantees for fault-tolerant real-time systems , 1999, Dependable Computing for Critical Applications 7.

[22]  Robert Grimm,et al.  Jinn: synthesizing dynamic bug detectors for foreign language interfaces , 2010, PLDI '10.

[23]  Peter M. Chen,et al.  How fail-stop are faulty programs? , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[24]  Jean Arlat,et al.  Dependability of COTS Microkernel-Based Systems , 2002, IEEE Trans. Computers.

[25]  James R. Larus,et al.  Language support for fast and reliable message-based communication in singularity OS , 2006, EuroSys.

[26]  Gabriel Parmer,et al.  Predictable, Efficient System-Level Fault Tolerance in C^3 , 2013, 2013 IEEE 34th Real-Time Systems Symposium.

[27]  Tipp Moseley,et al.  PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.

[28]  Brian N. Bershad,et al.  Improving the reliability of commodity operating systems , 2005, TOCS.

[29]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[30]  Roy H. Campbell,et al.  CuriOS: Improving Reliability through Operating System Structure , 2008, OSDI.

[31]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[32]  David I. August,et al.  Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.

[33]  Daniel Mossé,et al.  A responsiveness approach for scheduling fault recovery in real-time systems , 1999, Proceedings of the Fifth IEEE Real-Time Technology and Applications Symposium.

[34]  Jacob R. Lorch,et al.  Tardigrade: Leveraging Lightweight Virtual Machines to Easily and Efficiently Construct Fault-Tolerant Services , 2015, NSDI.

[35]  Richard West,et al.  Mutable Protection Domains: Adapting System Fault Isolation for Reliability and Efficiency , 2012, IEEE Transactions on Software Engineering.

[36]  J. Shapiro,et al.  EROS: a fast capability system , 2000, OPSR.

[37]  Qi Wang,et al.  SPeCK: a kernel for scalable predictability , 2015, 21st IEEE Real-Time and Embedded Technology and Applications Symposium.

[38]  Alan Burns,et al.  Analysis of checkpointing for schedulability of real-time systems , 1997, Proceedings Fourth International Workshop on Real-Time Computing Systems and Applications.

[39]  Herbert Bos,et al.  Reorganizing UNIX for Reliability , 2006, Asia-Pacific Computer Systems Architecture Conference.