Failure Resilience for Device Drivers

Studies have shown that device drivers and extensions contain 3-7 times more bugs than other operating system code and thus are more likely to fail. Therefore, we present a failure-resilient operating system design that can recover from dead drivers and other critical components - primarily through monitoring and replacing malfunctioning components on the fly - transparent to applications and without user intervention. This paper focuses on the post-mortem recovery procedure. We explain the working of our defect detection mechanism, the policy-driven recovery procedure, and post-restart reintegration of the components. Furthermore, we discuss the concrete steps taken to recover from network, block device, and character device driver failures. Finally, we evaluate our design using performance measurements, software fault-injection experiments, and an analysis of the reengineering effort.

[1]  YangJunfeng,et al.  An empirical study of operating systems errors , 2001 .

[2]  Jeffrey S. Chase,et al.  The role of accountability in dependable distributed systems , 2005 .

[3]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[4]  J. Löser,et al.  An I / O Architecture for Microkernel-Based Operating Systems , 2003 .

[5]  Jochen Liedtke,et al.  On micro-kernel construction , 1995, SOSP.

[6]  Herbert Bos,et al.  Safe kernel programming in the OKE , 2002, 2002 IEEE Open Architectures and Network Programming Proceedings. OPENARCH 2002 (Cat. No.02EX571).

[7]  Predictive Self-Healing in the Solaris TM 10 Operating System , 2004 .

[8]  Trent Jaeger,et al.  The SawMill multiserver approach , 2000, EW 9.

[9]  Eliane Martins,et al.  Injection of faults at component interfaces and inside the component code: are they equivalent? , 2006, 2006 Sixth European Dependable Computing Conference.

[10]  Jerome H. Saltzer,et al.  The protection of information in computer systems , 1975, Proc. IEEE.

[11]  Gernot Heiser,et al.  User-Level Device Drivers: Achieved Performance , 2005, Journal of Computer Science and Technology.

[12]  Anne-Marie Kermarrec,et al.  An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures , 2000, IEEE Trans. Computers.

[13]  George C. Necula,et al.  SafeDrive: safe and recoverable extensions using language-based techniques , 2006, OSDI '06.

[14]  Herbert Bos,et al.  Construction of a Highly Dependable Operating System , 2006, 2006 Sixth European Dependable Computing Conference.

[15]  Hermann Härtig,et al.  The Nizza secure-system architecture , 2005, 2005 International Conference on Collaborative Computing: Networking, Applications and Worksharing.

[16]  Brian N. Bershad,et al.  Improving the reliability of commodity operating systems , 2005, TOCS.

[17]  Jochen Liedtke,et al.  The performance of μ-kernel-based systems , 1997, SOSP.

[18]  Timothy C. K. Chou Beyond Fault Tolerance , 1997, Computer.

[19]  Brian N. Bershad,et al.  Recovering device drivers , 2004, TOCS.

[20]  George Candea,et al.  Improving availability with recursive microreboots: a soft-state system case study , 2004, Perform. Evaluation.

[21]  Brian N. Bershad,et al.  An I/O System for Mach 3.0 , 1991, USENIX MACH Symposium.

[22]  Stefan Götz,et al.  Unmodified Device Driver Reuse and Improved System Dependability via Virtual Machines , 2004, OSDI.

[23]  Andrew Warfield,et al.  Safe Hardware Access with the Xen Virtual Machine Monitor , 2007 .

[24]  James R. Larus,et al.  Sealing OS processes to improve dependability and safety , 2007, EuroSys '07.

[25]  Beng-Hong Lim,et al.  Virtualizing I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor , 2001, USENIX Annual Technical Conference, General Track.

[26]  Ram Chillarege,et al.  Generation of an error set that emulates software faults based on field data , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[27]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[28]  Herbert Bos,et al.  MINIX 3: a highly reliable, self-repairing operating system , 2006, OPSR.

[29]  Ravishankar K. Iyer,et al.  Networked Windows NT system field failure data analysis , 1999, Proceedings 1999 Pacific Rim International Symposium on Dependable Computing.

[30]  Love H. Seawright,et al.  VM/370 - A Study of Multiplicity and Usefulness , 1979, IBM Syst. J..

[31]  Victor P. Nelson Fault-tolerant computing: fundamental concepts , 1990, Computer.

[32]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[33]  J. Liedtke On -Kernel Construction , 1995 .

[34]  TarditiDavid,et al.  Sealing OS processes to improve dependability and safety , 2007 .

[35]  Dan Hildebrand,et al.  An Architectural Overview of QNX , 1992, USENIX Workshop on Microkernels and Other Kernel Architectures.

[36]  Mark Sullivan,et al.  Software defects and their impact on system availability-a study of field failures in operating systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[37]  Victor R. Basili,et al.  Software errors and complexity: an empirical investigation , 1993 .

[38]  Elaine J. Weyuker,et al.  The distribution of faults in a large industrial software system , 2002, ISSTA '02.

[39]  Herbert Bos,et al.  Reorganizing UNIX for Reliability , 2006, Asia-Pacific Computer Systems Architecture Conference.

[40]  Victor R. Basili,et al.  Software errors and complexity: an empirical investigation0 , 1984, CACM.

[41]  Gil Neiger,et al.  Intel ® Virtualization Technology for Directed I/O , 2006 .

[42]  Peter M. Chen,et al.  The systematic improvement of fault tolerance in the Rio file cache , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[43]  Hendrik Tews,et al.  The VFiasco approach for a verified operating system , 2005 .