A Generic Framework for Testing Parallel File Systems

Large-scale parallel file systems are of prime importance today. However, despite of the importance, their failure-recovery capability is much less studied compared with local storage systems. Recent studies on local storage systems have exposed various vulnerabilities that could lead to data loss under failure events, which raise the concern for parallel file systems built on top of them.This paper proposes a generic framework for testing the failure handling of large-scale parallel file systems. The framework captures all disk I/O commands on all storage nodes of the target system to emulate realistic failure states, and checks if the target system can recover to a consistent state without incurring data loss. We have built a prototype for the Lustre file system. Our preliminary results show that the framework is able to uncover the internal I/O behavior of Lustre under different workloads and failure conditions, which provides a solid foundation for further analyzing the failure recovery of parallel file systems.

[1]  Andrea C. Arpaci-Dusseau,et al.  Tolerating File-System Mistakes with EnvyFS , 2009, USENIX Annual Technical Conference.

[2]  Lara Dolecek,et al.  Tackling intracell variability in TLC Flash through tensor product codes , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[3]  Andrea C. Arpaci-Dusseau,et al.  An analysis of data corruption in the storage stack , 2008, TOS.

[4]  Steven Swanson,et al.  Understanding the impact of power loss on flash memory , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[6]  Paul H. Siegel,et al.  Characterizing flash memory: Anomalies, observations, and applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Mark Lillibridge,et al.  Reliability Analysis of SSDs Under Power Fault , 2016, ACM Trans. Comput. Syst..

[8]  Junfeng Yang,et al.  EXPLODE: a lightweight, general system for finding serious storage system errors , 2006, OSDI '06.

[9]  C. Partridge,et al.  Innovations in Internetworking , 1988 .

[10]  Fred Douglis,et al.  RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures , 2015, FAST.

[11]  Asim Kadav,et al.  SymDrive: Testing Drivers without Devices , 2012, OSDI.

[12]  Andrea C. Arpaci-Dusseau,et al.  Towards reliable storage systems , 2009 .

[13]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[14]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[15]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[16]  Mark Lillibridge,et al.  Torturing Databases for Fun and Profit , 2014, OSDI.

[17]  Dan Walsh,et al.  Design and implementation of the Sun network filesystem , 1985, USENIX Conference Proceedings.

[18]  Daniel S. Katz,et al.  Web-based Tools -- Montage: An astronomical image mosaic engine , 2007 .

[19]  Pallavi Joshi,et al.  SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems , 2014, OSDI.

[20]  Arif Merchant,et al.  Flash Reliability in Production: The Expected and the Unexpected , 2016, FAST.

[21]  Andrea C. Arpaci-Dusseau,et al.  All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications , 2014, OSDI.

[22]  Mark Lillibridge,et al.  Understanding the robustness of SSDS under power fault , 2013, FAST.

[23]  Changwoo Min,et al.  Cross-checking semantic correctness: the case of finding file system bugs , 2015, SOSP.

[24]  Andrea C. Arpaci-Dusseau,et al.  A Study of Linux File System Evolution , 2013, FAST.

[25]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[26]  John R. Douceur,et al.  Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[27]  Jeffrey F. Naughton,et al.  Impact of disk corruption on open-source DBMS , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[28]  Adam Chlipala,et al.  Using Crash Hoare logic for certifying the FSCQ file system , 2015, USENIX Annual Technical Conference.