论文信息 - SuperGlue: IDL-Based, System-Level Fault Tolerance for Embedded Systems

SuperGlue: IDL-Based, System-Level Fault Tolerance for Embedded Systems

As the processor feature sizes shrink, mitigating faults in low level system services has become a critical aspect of dependable system design. In this paper we introduce SuperGlue, an interface description language (IDL) and compiler for recovery from transient faults in a component-based operating system. SuperGlue generates code for interface-driven recovery that uses commodity hardware isolation, micro-rebooting, and interface-directed fault recovery to provide predictable and efficient recovery from faults that impact low-level system services. SuperGlue decreases the amount of recovery code system designers need to implement by an order of magnitude, and replaces it with declarative specifications. We evaluate SuperGlue with a fault injection campaign in low-level system components (e.g., memory mapping manager and scheduler). Additionally, we evaluate the performance of SuperGlue in a web-server application. Results show that SuperGlue improves system reliability with only a small performance degradation of 11.84%.

[1] Gabriel Parmer,et al. C'Mon: a predictable monitoring infrastructure for system-level latent fault detection and recovery , 2015, 21st IEEE Real-Time and Embedded Technology and Applications Symposium.

[2] George Candea,et al. Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[3] Sanjay J. Patel,et al. Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[4] Shubu Mukherjee,et al. Architecture Design for Soft Errors , 2008 .

[5] Jochen Liedtke,et al. On micro-kernel construction , 1995, SOSP.

[6] Johan Karlsson,et al. Toward dependability benchmarking of partitioning operating systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[7] Ravishankar K. Iyer,et al. An experimental study of soft errors in microprocessors , 2005, IEEE Micro.

[8] Massimo Violante,et al. An accurate analysis of the effects of soft errors in the instruction and data caches of a pipelined microprocessor , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[9] Gabriel Parmer. The Case for Thread Migration : Predictable IPC in a Customizable and Reliable OS , 2010 .

[10] David I. August,et al. SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[11] Michael Nicolaidis. Time redundancy based soft-error tolerance to rescue nanometer technologies , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[12] Gernot Heiser,et al. From L3 to seL4 what have we learnt in 20 years of L4 microkernels? , 2013, SOSP.

[13] Sanjay J. Patel,et al. ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2006, IEEE Trans. Dependable Secur. Comput..

[14] Jay Lepreau,et al. Evolving Mach 3.0 to A Migrating Thread Model , 1994, USENIX Winter.

[15] Gedare Bloom,et al. CBufs: efficient, system-wide memory management and sharing , 2016, ISMM.

[16] M SwiftMichael,et al. Improving the reliability of commodity operating systems , 2003 .

[17] Michael Stumm,et al. Otherworld: giving applications a chance to survive OS kernel crashes , 2010, EuroSys '10.

[18] Eric Eide,et al. Flick: a flexible, optimizing IDL compiler , 1997, PLDI '97.

[19] Isabelle Puaut,et al. Experimental evaluation of the fail-silent behavior of a distributed real-time run-time support built from COTS components , 2001, 2001 International Conference on Dependable Systems and Networks.

[20] Jean Arlat,et al. Formal specification for building robust real-time microkernels , 2000, Proceedings 21st IEEE Real-Time Systems Symposium.

[21] David Wright,et al. Probabilistic scheduling guarantees for fault-tolerant real-time systems , 1999, Dependable Computing for Critical Applications 7.

[22] Robert Grimm,et al. Jinn: synthesizing dynamic bug detectors for foreign language interfaces , 2010, PLDI '10.

[23] Peter M. Chen,et al. How fail-stop are faulty programs? , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[24] Jean Arlat,et al. Dependability of COTS Microkernel-Based Systems , 2002, IEEE Trans. Computers.

[25] James R. Larus,et al. Language support for fast and reliable message-based communication in singularity OS , 2006, EuroSys.

[26] Gabriel Parmer,et al. Predictable, Efficient System-Level Fault Tolerance in C^3 , 2013, 2013 IEEE 34th Real-Time Systems Symposium.

[27] Tipp Moseley,et al. PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.

[28] Brian N. Bershad,et al. Improving the reliability of commodity operating systems , 2005, TOCS.

[29] Shekhar Y. Borkar,et al. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[30] Roy H. Campbell,et al. CuriOS: Improving Reliability through Operating System Structure , 2008, OSDI.

[31] Sarita V. Adve,et al. Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[32] David I. August,et al. Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.

[33] Daniel Mossé,et al. A responsiveness approach for scheduling fault recovery in real-time systems , 1999, Proceedings of the Fifth IEEE Real-Time Technology and Applications Symposium.

[34] Jacob R. Lorch,et al. Tardigrade: Leveraging Lightweight Virtual Machines to Easily and Efficiently Construct Fault-Tolerant Services , 2015, NSDI.

[35] Richard West,et al. Mutable Protection Domains: Adapting System Fault Isolation for Reliability and Efficiency , 2012, IEEE Transactions on Software Engineering.

[36] J. Shapiro,et al. EROS: a fast capability system , 2000, OPSR.

[37] Qi Wang,et al. SPeCK: a kernel for scalable predictability , 2015, 21st IEEE Real-Time and Embedded Technology and Applications Symposium.

[38] Alan Burns,et al. Analysis of checkpointing for schedulability of real-time systems , 1997, Proceedings Fourth International Workshop on Real-Time Computing Systems and Applications.

[39] Herbert Bos,et al. Reorganizing UNIX for Reliability , 2006, Asia-Pacific Computer Systems Architecture Conference.