WAFL Iron: Repairing Live Enterprise File Systems

Consistent and timely access to an arbitrarily damaged file system is an important requirement of enterpriseclass systems. Repairing file system inconsistencies is accomplished most simply when file system access is limited to the repair tool. Checking and repairing a file system while it is open for general access present unique challenges. In this paper, we explore these challenges, present our online repair tool for the NetApp® WAFL® file system, and show how it achieves the same results as offline repair even while client access is enabled. We present some implementation details and evaluate its performance. To the best of our knowledge, this publication is the first to describe a fully functional online repair tool.

[1]  Ram Kesavan,et al.  Scalable Write Allocation in the WAFL File System , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[2]  Andrea C. Arpaci-Dusseau,et al.  Ffsck: the fast file system checker , 2013, FAST.

[3]  Anoop Gupta,et al.  Hive: fault containment for shared-memory multiprocessors , 1995, SOSP.

[4]  J. Ziegler,et al.  Effect of Cosmic Rays on Computer Memories , 1979, Science.

[5]  Vinay Devadas,et al.  Think Global, Act Local: A Buffer Cache Design for Global Ordering and Parallel Processing in the WAFL File System , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[6]  Angela Demke Brown,et al.  Recon: Verifying file system consistency at runtime , 2012, TOS.

[7]  Yuvraj Patel,et al.  Algorithms and Data Structures for Efficient Free Space Reclamation in WAFL , 2017, FAST.

[8]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[9]  Zach Brown,et al.  Chunkfs: Using Divide-and-Conquer to Improve File System Reliability and Repair , 2006, HotDep.

[10]  Yale N. Patt,et al.  Metadata update performance in file systems , 1994, OSDI '94.

[11]  James Lau,et al.  File System Design for an NFS File Server Appliance , 1994, USENIX Winter.

[12]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[13]  Harendra Kumar,et al.  High Performance Metadata Integrity Protection in the WAFL Copy-on-Write File System , 2017, FAST.

[14]  Robert B. Hagmann,et al.  Reimplementing the Cedar file system using logging and group commit , 1987, SOSP '87.

[15]  Peter F. Corbett,et al.  Row-Diagonal Parity for Double Disk Failure Correction (Awarded Best Paper!) , 2004, USENIX Conference on File and Storage Technologies.

[16]  Xin Li,et al.  A Memory Soft Error Measurement on Production Systems , 2007, USENIX Annual Technical Conference.

[17]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[18]  Andrea C. Arpaci-Dusseau,et al.  End-to-end Data Integrity for File Systems: A ZFS Case Study , 2010, FAST.

[19]  Robert Baumann,et al.  Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.

[20]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[21]  Andrea C. Arpaci-Dusseau,et al.  SQCK: A Declarative File System Checker , 2008, OSDI.

[22]  Val Henson Reducing fsck time for ext2 file systems , 2006 .

[23]  Marshall K. McKusick,et al.  Running "fsck" in the Background , 2002, BSDCon.

[24]  Margo I. Seltzer,et al.  Unifying File System Protection , 2001, USENIX Annual Technical Conference, General Track.

[25]  Peter Kulchyski and , 2015 .

[26]  Vinay Devadas,et al.  To Waffinity and Beyond: A Scalable Architecture for Incremental Parallelization of File System Code , 2016, OSDI.

[27]  James L. Walsh,et al.  Field testing for cosmic ray soft errors in semiconductor memories , 1996, IBM J. Res. Dev..

[28]  Lisa Spainhower,et al.  Commercial fault tolerance: a tale of two systems , 2004, IEEE Transactions on Dependable and Secure Computing.

[29]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[30]  T. J. Kowalski,et al.  Fsck—the UNIX file system check program , 1990 .

[31]  Dawson R. Engler,et al.  ARCHER: using symbolic, path-sensitive analysis to detect memory access errors , 2003, ESEC/FSE-11.

[32]  Arkady Kanevsky,et al.  FlexVol: Flexible, Efficient File Volume Virtualization in WAFL , 2008, USENIX Annual Technical Conference.

[33]  Yuvraj Patel,et al.  Efficient Free Space Reclamation in WAFL , 2017, ACM Trans. Storage.

[34]  T. May,et al.  Alpha-particle-induced soft errors in dynamic memories , 1979, IEEE Transactions on Electron Devices.