Why panic()?: improving reliability with restartable file systems

The file system is one of the most critical components of the operating system. Almost all applications running in the operating system require file systems to be available for their proper operation. Though file-system availability is critical in many cases, very little work has been done on tolerating file system crashes. In this paper, we propose Membrane, a set of changes to the operating system to support restartable file systems. Membrane allows an operating system to tolerate a broad class of file system failures and does so while remaining transparent to running applications; upon failure, the file system restarts, its state is restored, and pending application requests are serviced as if no failure had occurred. Our initial evaluation ofMembrane with ext2 shows thatMembrane induces little performance overhead and can tolerate a wide range of file system crashes. More critically, Membrane does so with few changes to ext2, thus improving robustness to crashes without mandating intrusive changes to existing filesystem code.

[1]  Yuanyuan Zhou,et al.  Rx: treating bugs as allergies---a safe method to survive software failures , 2005, SOSP '05.

[2]  Jonathan S. Shapiro,et al.  EROS: A Principle-Driven Operating System from the Ground Up , 2002, IEEE Softw..

[3]  Robert B. Hagmann,et al.  Reimplementing the Cedar file system using logging and group commit , 1987, SOSP '87.

[4]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[5]  Steve R. Kleiman,et al.  Extent-like Performance from a UNIX File System , 1991, USENIX Winter.

[6]  Tracy Kidder,et al.  Soul of a New Machine , 1981 .

[7]  George C. Necula,et al.  SafeDrive: safe and recoverable extensions using language-based techniques , 2006, OSDI '06.

[8]  Brian N. Bershad,et al.  Improving the reliability of commodity operating systems , 2003, SOSP '03.

[9]  Martin C. Rinard,et al.  Automatic detection and repair of errors in data structures , 2003, OOPSLA '03.

[10]  Emin Gün Sirer,et al.  Device Driver Safety Through a Reference Validation Mechanism , 2008, OSDI.

[11]  Stefan Götz,et al.  Unmodified Device Driver Reuse and Improved System Dependability via Virtual Machines , 2004, OSDI.

[12]  Brian N. Bershad,et al.  Improving the reliability of commodity operating systems , 2005, TOCS.

[13]  Andrea C. Arpaci-Dusseau,et al.  EIO: Error Handling is Occasionally Correct , 2008, FAST.

[14]  Andrew Warfield,et al.  Safe Hardware Access with the Xen Virtual Machine Monitor , 2007 .

[15]  Roy H. Campbell,et al.  CuriOS: Improving Reliability through Operating System Structure , 2008, OSDI.

[16]  Wei Hu,et al.  Scalability in the XFS File System , 1996, USENIX Annual Technical Conference.

[17]  Dawson R. Engler,et al.  Bugs as deviant behavior: a general approach to inferring errors in systems code , 2001, SOSP.

[18]  Junfeng Yang,et al.  Using model checking to find serious file system errors , 2004, TOCS.

[19]  Brian N. Bershad,et al.  Recovering device drivers , 2004, TOCS.