Fine-grained fault tolerance using device checkpoints

Recovering faults in drivers is difficult compared to other code because their state is spread across both memory and a device. Existing driver fault-tolerance mechanisms either restart the driver and discard its state, which can break applications, or require an extensive logging mechanism to replay requests and recreate driver state. Even logging may be insufficient, though, if the semantics of requests are ambiguous. In addition, these systems either require large subsystems that must be kept up-to-date as the kernel changes, or require substantial rewriting of drivers. We present a new driver fault-tolerance mechanism that provides fine-grained control over the code protected. Fine-Grained Fault Tolerance (FGFT) isolates driver code at the granularity of a single entry point. It executes driver code as a transaction, allowing roll back if the driver fails. We develop a novel checkpointing mechanism to save and restore device state using existing power management code. Unlike past systems, FGFT can be incrementally deployed in a single driver without the need for a large kernel subsystem, but at the cost of small modifications to the driver. In the evaluation, we show that FGFT can have almost zero runtime cost in many cases, and that checkpoint-based recovery can reduce the duration of a failure by 79% compared to restarting the driver. Finally, we show that applying FGFT to a driver requires little effort, and the majority of drivers in common classes already contain the power-management code needed for checkpoint/restore.

[1]  Michael M. Swift,et al.  Protecting Commodity Operating System Kernels from Vulnerable Device Drivers , 2009, 2009 Annual Computer Security Applications Conference.

[2]  Asim Kadav,et al.  Tolerating hardware device failures in software , 2009, SOSP '09.

[3]  YangJunfeng,et al.  An empirical study of operating systems errors , 2001 .

[4]  Roy H. Campbell,et al.  CuriOS: Improving Reliability through Operating System Structure , 2008, OSDI.

[5]  Herbert Bos,et al.  Failure Resilience for Device Drivers , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[6]  Asim Kadav,et al.  Live migration of direct-access devices , 2008, OPSR.

[7]  Xi Wang,et al.  Software fault isolation with API integrity and multi-principal modules , 2011, SOSP.

[8]  Michael I. Jordan,et al.  Scalable statistical bug isolation , 2005, PLDI '05.

[9]  Brian N. Bershad,et al.  Improving the reliability of commodity operating systems , 2005, TOCS.

[10]  Liviu Iftode,et al.  Enforcing authorization policies using transactional memory introspection , 2008, CCS.

[11]  Somesh Jha,et al.  The design and implementation of microdrivers , 2008, ASPLOS.

[12]  Luis Ceze,et al.  Operating System Implications of Fast, Cheap, Non-Volatile Memory , 2011, HotOS.

[13]  Brian N. Bershad,et al.  Recovering device drivers , 2004, TOCS.

[14]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[15]  Leonid Ryzhyk,et al.  Dingo: taming device drivers , 2009, EuroSys '09.

[16]  Shakeel Butt,et al.  Protecting Commodity OS Kernels from Vulnerable Device Drivers , 2008 .

[17]  Margo I. Seltzer,et al.  Dealing with disaster: surviving misbehaved kernel extensions , 1996, OSDI '96.

[18]  Josep Torrellas,et al.  ReViveI/O: efficient handling of I/O in highly-available rollback-recovery servers , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[19]  Håkan Grahn,et al.  Transactional memory , 2010, J. Parallel Distributed Comput..

[20]  Junfeng Yang,et al.  An empirical study of operating systems errors , 2001, SOSP.

[21]  Andrea C. Arpaci-Dusseau,et al.  Membrane: Operating system support for restartable file systems , 2010, TOS.

[22]  Robert Wahbe,et al.  Efficient software-based fault isolation , 1994, SOSP '93.

[23]  George C. Necula,et al.  SafeDrive: safe and recoverable extensions using language-based techniques , 2006, OSDI '06.

[24]  Martín Abadi,et al.  XFI: software guards for system address spaces , 2006, OSDI '06.

[25]  Milo M. K. Martin,et al.  SoftBound: highly compatible and complete spatial memory safety for c , 2009, PLDI '09.

[26]  James R. Larus,et al.  Transactional Memory , 2006, Transactional Memory.

[27]  Donald E. Porter,et al.  TxLinux: using and managing hardware transactional memory in an operating system , 2007, SOSP.

[28]  Sarita V. Adve,et al.  Detecting and recovering from in-core hardware faults through software anomaly treatment , 2011 .

[29]  Krste Asanovic,et al.  Mondrix: memory isolation for linux using mondriaan memory protection , 2005, SOSP '05.

[30]  Orion Hodson,et al.  Whole-system persistence , 2012, ASPLOS XVII.

[31]  David Brumley,et al.  Privtrans: Automatically Partitioning Programs for Privilege Separation , 2004, USENIX Security Symposium.

[32]  Emin Gün Sirer,et al.  Device Driver Safety Through a Reference Validation Mechanism , 2008, OSDI.

[33]  George C. Necula,et al.  CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs , 2002, CC.

[34]  Andrew Warfield,et al.  Safe Hardware Access with the Xen Virtual Machine Monitor , 2007 .

[35]  Vern Paxson,et al.  Bro: a system for detecting network intruders in real-time , 1998, Comput. Networks.

[36]  Gernot Heiser,et al.  User-Level Device Drivers: Achieved Performance , 2005, Journal of Computer Science and Technology.

[37]  Donald E. Porter,et al.  Operating System Transactions , 2009, SOSP '09.

[38]  Greg Kroah-Hartman,et al.  Linux Device Drivers, 3rd Edition , 2005 .

[39]  Silas Boyd-Wickizer,et al.  Tolerating Malicious Device Drivers in Linux , 2010, USENIX Annual Technical Conference.

[40]  Miguel Castro,et al.  Fast byte-granularity software fault isolation , 2009, SOSP '09.

[41]  Asim Kadav,et al.  Understanding modern device drivers , 2012, ASPLOS XVII.

[42]  Byung-Gon Chun,et al.  Augmented Smartphone Applications Through Clone Cloud Execution , 2009, HotOS.

[43]  Michael M. Swift,et al.  xCalls: safe I/O in memory transactions , 2009, EuroSys '09.

[44]  Xin Zheng,et al.  Secure web applications via automatic partitioning , 2007, SOSP.