CrashMonkey and ACE

We present CrashMonkey and Ace, a set of tools to systematically find crash-consistency bugs in Linux file systems. CrashMonkey is a record-and-replay framework which tests a given workload on the target file system by simulating power-loss crashes while the workload is being executed, and checking if the file system recovers to a correct state after each crash. Ace automatically generates all the workloads to be run on the target file system. We build CrashMonkey and Ace based on a new approach to test file-system crash consistency: bounded black-box crash testing (B3). B3 tests the file system in a black-box manner using workloads of file-system operations. Since the space of possible workloads is infinite, B3 bounds this space based on parameters such as the number of file-system operations or which operations to include, and exhaustively generates workloads within this bounded space. B3 builds upon insights derived from our study of crash-consistency bugs reported in Linux file systems in the last 5 years. We observed that most reported bugs can be reproduced using small workloads of three or fewer file-system operations on a newly created file system, and that all reported bugs result from crashes after fsync()-related system calls. CrashMonkey and Ace are able to find 24 out of the 26 crash-consistency bugs reported in the last 5 years. Our tools also revealed 10 new crash-consistency bugs in widely used, mature Linux file systems, 7 of which existed in the kernel since 2014. Additionally, our tools found a crash-consistency bug in a verified file system, FSCQ. The new bugs result in severe consequences like broken rename atomicity, loss of persisted files and directories, and data loss.

[1]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Adam Chlipala,et al.  Using Crash Hoare logic for certifying the FSCQ file system , 2015, USENIX Annual Technical Conference.

[4]  Pandian Raju,et al.  Finding Crash-Consistency Bugs with Bounded Black-Box Crash Testing , 2018, OSDI.

[5]  Vijay Chidambaram,et al.  CrashMonkey: A Framework to Automatically Test File-System Crash Consistency , 2017, HotStorage.

[6]  Harendra Kumar,et al.  High Performance Metadata Integrity Protection in the WAFL Copy-on-Write File System , 2017, FAST.

[7]  Junfeng Yang,et al.  EXPLODE: a lightweight, general system for finding serious storage system errors , 2006, OSDI '06.

[8]  Vijay Chidambaram,et al.  Orderless and Eventually Durable File Systems , 2015 .

[9]  Andrea C. Arpaci-Dusseau,et al.  All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications , 2014, OSDI.

[10]  M. Frans Kaashoek,et al.  Scaling a file system to many cores using an operation log , 2017, SOSP.

[11]  James Lau,et al.  File System Design for an NFS File Server Appliance , 1994, USENIX Winter.

[12]  Josef Bacik,et al.  BTRFS: The Linux B-Tree Filesystem , 2013, TOS.

[13]  Andrea C. Arpaci-Dusseau,et al.  Analysis and Evolution of Journaling File Systems , 2005, USENIX Annual Technical Conference, General Track.

[14]  Gregory R. Ganger,et al.  Soft Updates: A Technique for Eliminating Most Synchronous Writes in the Fast Filesystem , 1999, USENIX Annual Technical Conference, FREENIX Track.

[15]  Adam Chlipala,et al.  Verifying a high-performance crash-safe file system using a tree specification , 2017, SOSP.

[16]  Jaemin Jung,et al.  Barrier-Enabled IO Stack for Flash Storage , 2018, FAST.

[17]  Michael A. Bender,et al.  Optimizing Every Operation in a Write-optimized File System , 2016, USENIX Annual Technical Conference.

[18]  Nicolas Christin,et al.  Push-Button Verification of File Systems via Crash Refinement , 2016, USENIX Annual Technical Conference.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Joe Mambretti,et al.  Next Generation Clouds, the Chameleon Cloud Testbed, and Software Defined Networking (SDN) , 2015, 2015 International Conference on Cloud Computing Research and Innovation (ICCCRI).

[21]  Youngjin Kwon,et al.  TxFS , 2019, USENIX Annual Technical Conference.

[22]  Andrea C. Arpaci-Dusseau,et al.  Optimistic crash consistency , 2013, SOSP.

[23]  Mark Lillibridge,et al.  Torturing Databases for Fun and Profit , 2014, OSDI.

[24]  Xi Wang,et al.  Specifying and Checking File System Crash-Consistency Models , 2016, ASPLOS.

[25]  Joo Young Hwang,et al.  FStream: Managing Flash Streams in the File System , 2018, FAST.

[26]  Andrea C. Arpaci-Dusseau,et al.  Consistency without ordering , 2012, FAST.

[27]  Heon Young Yeom,et al.  High-Performance Transaction Processing in Journaling File Systems , 2018, FAST.

[28]  An-I Wang,et al.  The Composite-file File System: Decoupling the One-to-One Mapping of Files and Metadata for Better Performance , 2016, FAST.

[29]  Wei Hu,et al.  Scalability in the XFS File System , 1996, USENIX Annual Technical Conference.

[30]  Andrea C. Arpaci-Dusseau,et al.  Application Crash Consistency and Performance with CCFS , 2017, USENIX Annual Technical Conference.

[31]  Junfeng Yang,et al.  Using model checking to find serious file system errors , 2004, TOCS.

[32]  Abutalib Aghayev,et al.  Evolving Ext4 for Shingled Disks , 2017, FAST.

[33]  Joo Young Hwang,et al.  F2FS: A New File System for Flash Storage , 2015, FAST.