On Fault Resilience of File System Checkers

File system checkers serve as the last line of defense to recover a corrupted file system back to a consistent state. In this position paper, we study the behavior of file system checkers under emulated faults. We answer two important questions: instead of fixing the original corruption, will an interrupted checker cause more severe damage? If so, can the additional damage be fixed by the existing checker? Our preliminary results show that there are vulnerabilities in popular file system checkers which could lead to unrecoverable data loss under faults. 1 Motivation Despite of various protection techniques [19, 12, 10, 18, 6], file systems may still become corrupted for various reasons including power outages, system crashes, hardware failures, software bugs, etc [15, 16, 9, 5, 14, 20, 17]. Thus, most file systems come with a checker to serve as the last line of defense to recover the corrupted file system back to a healthy state [15, 2, 11, 1, 3]. Due to such importance, abundant work has been done to improve file system checkers in terms of both performance and reliability [15, 13, 8]. Complementary to the existing efforts, in this paper we study the behavior of file system checkers under faults. This is motivated by a recent accident happened at the High Performance Computing Center (HPCC) in Texas [7, 4], where the Lustre file system suffered severe data loss after experiencing two consecutive power outages: the first one triggered the Lustre checker (i.e., LFSCK [2]) after restarting the cluster, while the second one interrupted LFSCK and led to the final downtime. Since Lustre is built on top of a variant of Ext4 and LFSCK relies on the local file system checker, the overall checking and recovery procedure is complicated. As one step to pinpoint the vulnerabilities and build robust file system checkers, we perform a comprehensive study on the fault resilience of e2fsck [1], the default checker for the widely used Ext2/Ext3/Ext4 file systems. Corruption Types Percentage unmountable 0.57% file content corruption 2.85% misplacement of files 6.28% others 0.57% Table 1: Four types of unrecoverable corruption incurred by an interrupted e2fsck; the percentage is defined as the number of occurrences divided by the total number of test images.

[1]  Yale N. Patt,et al.  Soft updates: a solution to the metadata update problem in file systems , 2000 .

[2]  Margo I. Seltzer,et al.  Unifying File System Protection , 2001, USENIX Annual Technical Conference, General Track.

[3]  Andrea C. Arpaci-Dusseau,et al.  An analysis of data corruption in the storage stack , 2008, TOS.

[4]  Andrea C. Arpaci-Dusseau,et al.  SQCK: A Declarative File System Checker , 2008, OSDI.

[5]  Stephen C. Tweedie,et al.  Journaling the Linux ext2fs Filesystem , 2008 .

[6]  Andrea C. Arpaci-Dusseau,et al.  Tolerating File-System Mistakes with EnvyFS , 2009, USENIX Annual Technical Conference.

[7]  George Candea,et al.  Scalable testing of file system checkers , 2012, EuroSys '12.

[8]  Andrea C. Arpaci-Dusseau,et al.  Consistency without ordering , 2012, FAST.

[9]  Andrea C. Arpaci-Dusseau,et al.  Ffsck: the fast file system checker , 2013, FAST.

[10]  Andrea C. Arpaci-Dusseau,et al.  A Study of Linux File System Evolution , 2013, FAST.

[11]  Mark Lillibridge,et al.  Torturing Databases for Fun and Profit , 2014, OSDI.

[12]  Andrea C. Arpaci-Dusseau,et al.  All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications , 2014, OSDI.

[13]  Changwoo Min,et al.  Cross-checking semantic correctness: the case of finding file system bugs , 2015, SOSP.

[14]  Adam Chlipala,et al.  Using Crash Hoare logic for certifying the FSCQ file system , 2015, USENIX Annual Technical Conference.

[15]  Yong Chen,et al.  A Generic Framework for Testing Parallel File Systems , 2016, 2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS).