Optimistic crash consistency

We introduce optimistic crash consistency, a new approach to crash consistency in journaling file systems. Using an array of novel techniques, we demonstrate how to build an optimistic commit protocol that correctly recovers from crashes and delivers high performance. We implement this optimistic approach within a Linux ext4 variant which we call OptFS. We introduce two new file-system primitives, osync() and dsync(), that decouple ordering of writes from their durability. We show through experiments that OptFS improves performance for many workloads, sometimes by an order of magnitude; we confirm its correctness through a series of robustness tests, showing it recovers to a consistent state after crashes. Finally, we show that osync() and dsync() are useful in atomic file system and database update scenarios, both improving performance and meeting application-level consistency demands.

[1]  Robert S. Fabry,et al.  A fast file system for UNIX , 1984, TOCS.

[2]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX Annual Technical Conference.

[3]  Margo I. Seltzer,et al.  Disk Scheduling Revisited , 1990 .

[4]  Erez Zadok,et al.  I3FS: An In-Kernel Integrity Checker and Intrusion Detection File System , 2004, LISA.

[5]  Yale N. Patt,et al.  Scheduling algorithms for modern disk drives , 1994, SIGMETRICS 1994.

[6]  Yale N. Patt,et al.  Metadata update performance in file systems , 1994, OSDI '94.

[7]  Margo I. Seltzer,et al.  Unifying File System Protection , 2001, USENIX Annual Technical Conference, General Track.

[8]  Amin Vahdat,et al.  Design and evaluation of a continuous consistency model for replicated services , 2000, OSDI.

[9]  Andrea C. Arpaci-Dusseau,et al.  Consistency without ordering , 2012, FAST.

[10]  Andrea C. Arpaci-Dusseau,et al.  Analysis and Evolution of Journaling File Systems , 2005, USENIX Annual Technical Conference, General Track.

[11]  Kimberly Keeton,et al.  LazyBase: trading freshness for performance in a scalable database , 2012, EuroSys '12.

[12]  Maurice Herlihy,et al.  Apologizing versus asking permission: optimistic concurrency control for abstract data types , 1990, TODS.

[13]  Margo I. Seltzer,et al.  Journaling Versus Soft Updates: Asynchronous Meta-data Protection in File Systems , 2000, USENIX Annual Technical Conference, General Track.

[14]  Dirk Beyer,et al.  Designing for Disasters , 2004, FAST.

[15]  Lei Zhang,et al.  Generalized file system dependencies , 2007, SOSP.

[16]  George Eckel Inside Windows NT , 1993 .

[17]  John Wilkes,et al.  Disk scheduling algorithms based on rotational position , 1991 .

[18]  Ion Stoica,et al.  Probabilistically Bounded Staleness for Practical Partial Quorums , 2012, Proc. VLDB Endow..

[19]  Jason Flinn,et al.  Rethink the sync , 2006, OSDI '06.

[20]  Marshall K. McKusick Disks from the perspective of a file system , 2012, CACM.

[21]  R. Card,et al.  Design and Implementation of the Second Extended Filesystem , 2001 .

[22]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[23]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[24]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[25]  Yang Wang,et al.  Robustness in the Salus Scalable Block Store , 2013, NSDI.

[26]  Wei Hu,et al.  Scalability in the XFS File System , 1996, USENIX Annual Technical Conference.

[27]  Robert B. Hagmann,et al.  Reimplementing the Cedar file system using logging and group commit , 1987, SOSP '87.

[28]  James Lau,et al.  File System Design for an NFS File Server Appliance , 1994, USENIX Winter.

[29]  Gregory R. Ganger,et al.  The DiskSim Simulation Environment Version 4.0 Reference Manual (CMU-PDL-08-101) , 1998 .

[30]  Cyril U. Orji,et al.  Write-only disk caches , 1990, SIGMOD '90.

[31]  Stephen C. Tweedie,et al.  Journaling the Linux ext2fs Filesystem , 2008 .

[32]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[33]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[34]  S. Brandt,et al.  Data Placement for Copy-on-write Using Virtual Contiguity Contents , 2002 .

[35]  Parag Agrawal,et al.  The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[36]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.