Membrane: Operating system support for restartable file systems

We introduce Membrane, a set of changes to the operating system to support restartable file systems. Membrane allows an operating system to tolerate a broad class of file system failures, and does so while remaining transparent to running applications; upon failure, the file system restarts, its state is restored, and pending application requests are serviced as if no failure had occurred. Membrane provides transparent recovery through a lightweight logging and checkpoint infrastructure, and includes novel techniques to improve performance and correctness of its fault-anticipation and recovery machinery. We tested Membrane with ext2, ext3, and VFAT. Through experimentation, we show that Membrane induces little performance overhead and can tolerate a wide range of file system crashes. More critically, Membrane does so with little or no change to existing file systems, thus improving robustness to crashes without mandating intrusive changes to existing file-system code.

[1]  Jeffrey C. Mogul,et al.  A Better Update Policy , 1994, USENIX Summer.

[2]  Daniel P. Siewiorek,et al.  Automated robustness testing of off-the-shelf software components , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[3]  Herbert Bos,et al.  Construction of a Highly Dependable Operating System , 2006, 2006 Sixth European Dependable Computing Conference.

[4]  Anoop Gupta,et al.  Hive: fault containment for shared-memory multiprocessors , 1995, SOSP.

[5]  Roy H. Campbell,et al.  CuriOS: Improving Reliability through Operating System Structure , 2008, OSDI.

[6]  Junfeng Yang,et al.  EXPLODE: a lightweight, general system for finding serious storage system errors , 2006, OSDI '06.

[7]  Wei Hu,et al.  Scalability in the XFS File System , 1996, USENIX Annual Technical Conference.

[8]  George C. Necula,et al.  SafeDrive: safe and recoverable extensions using language-based techniques , 2006, OSDI '06.

[9]  James Lau,et al.  File System Design for an NFS File Server Appliance , 1994, USENIX Winter.

[10]  Martin Rinard,et al.  Automatic detection and repair of errors in data structures , 2003, OOPSLA 2003.

[11]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[12]  Yuanyuan Zhou,et al.  Rx: treating bugs as allergies---a safe method to survive software failures , 2005, SOSP '05.

[13]  Philip H. Dorn,et al.  The Soul of a New Machine , 1982, Annals of the History of Computing.

[14]  Emin Gün Sirer,et al.  Device Driver Safety Through a Reference Validation Mechanism , 2008, OSDI.

[15]  Dawson R. Engler,et al.  Bugs as deviant behavior: a general approach to inferring errors in systems code , 2001, SOSP.

[16]  Junfeng Yang,et al.  Using model checking to find serious file system errors , 2004, TOCS.

[17]  Jonathan S. Shapiro,et al.  EROS: A Principle-Driven Operating System from the Ground Up , 2002, IEEE Softw..

[18]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[19]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[20]  Junfeng Yang,et al.  An empirical study of operating systems errors , 2001, SOSP.

[21]  Steve R. Kleiman,et al.  Vnodes: An Architecture for Multiple File System Types in Sun UNIX , 1986, USENIX Summer.

[22]  Randal C. Burns,et al.  Ext3cow: a time-shifting file system for regulatory compliance , 2005, TOS.

[23]  Guru M. Parulkar,et al.  The UVM Virtual Memory System , 1999, USENIX Annual Technical Conference, General Track.

[24]  Andrea C. Arpaci-Dusseau,et al.  EIO: Error Handling is Occasionally Correct , 2008, FAST.

[25]  Stefan Götz,et al.  Unmodified Device Driver Reuse and Improved System Dependability via Virtual Machines , 2004, OSDI.

[26]  David A. Patterson,et al.  An Analysis of Error Behaviour in a Large Storage System , 1999 .

[27]  Tracy Kidder,et al.  Soul of a New Machine , 1981 .

[28]  George Candea,et al.  Crash-Only Software , 2003, HotOS.

[29]  Robert B. Hagmann,et al.  Reimplementing the Cedar file system using logging and group commit , 1987, SOSP '87.

[30]  Martín Abadi,et al.  XFI: software guards for system address spaces , 2006, OSDI '06.

[31]  Andrew Warfield,et al.  Safe Hardware Access with the Xen Virtual Machine Monitor , 2007 .

[32]  George C. Necula,et al.  Finding and preventing run-time error handling mistakes , 2004, OOPSLA.

[33]  Herbert Bos,et al.  Failure Resilience for Device Drivers , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[34]  RosenblumMendel,et al.  The design and implementation of a log-structured file system , 1991 .

[35]  Martin C. Rinard,et al.  Automatic detection and repair of errors in data structures , 2003, OOPSLA '03.

[36]  Alan Messer,et al.  Increasing relevance of memory hardware errors: a case for recoverable programming models , 2000, EW 9.

[37]  Brian N. Bershad,et al.  Improving the reliability of commodity operating systems , 2005, TOCS.

[38]  Jeffrey S. Chase,et al.  Architecture support for single address space operating systems , 1992, ASPLOS V.

[39]  Brian N. Bershad,et al.  Recovering device drivers , 2004, TOCS.

[40]  Steve R. Kleiman,et al.  Extent-like Performance from a UNIX File System , 1991, USENIX Winter.

[41]  Yuanyuan Zhou,et al.  Learning from mistakes: a comprehensive study on real world concurrency bug characteristics , 2008, ASPLOS.