PFault: A General Framework for Analyzing the Reliability of High-Performance Parallel File Systems

High-performance parallel file systems (PFSes) are of prime importance today. However, despite the importance, their reliability is much less studied compared with that of local storage systems, largely due to the lack of an effective analysis methodology. In this paper, we introduce PFault, a general framework for analyzing the failure handling of PFSes. PFault automatically emulates the failure state of each storage device in the target PFS based on a set of well-defined fault models, and enables analyzing the recoverability of the PFS under faults systematically. To demonstrate the practicality, we apply PFault to study Lustre, one of the most widely used PFSes. Our analysis reveals a number of cases where Lustre's checking and repairing utility LFSCK fails with unexpected symptoms (e.g., I/O error, hang, reboot). Moreover, with the help of PFault, we are able to identify a resource leak problem where a portion of Lustre's internal namespace and storage space become unusable even after running LFSCK. On the other hand, we also verify that the latest Lustre has made noticeable improvement in terms of failure handling comparing to a previous version. We hope our study and framework can help improve PFSes for reliable high-performance computing.

[1]  Mark Lillibridge,et al.  Reliability Analysis of SSDs Under Power Fault , 2016, ACM Trans. Comput. Syst..

[2]  Hao Yu,et al.  Early experiences in application level I/O tracing on blue gene systems , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[3]  Andrea C. Arpaci-Dusseau,et al.  An analysis of data corruption in the storage stack , 2008, TOS.

[4]  Junfeng Yang,et al.  EXPLODE: a lightweight, general system for finding serious storage system errors , 2006, OSDI '06.

[5]  Hao Yu,et al.  Application level I/O caching on Blue Gene/P systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[6]  Robert Latham,et al.  24/7 Characterization of petascale I/O workloads , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[7]  Margo I. Seltzer,et al.  File system aging—increasing the relevance of file system benchmarks , 1997, SIGMETRICS '97.

[8]  Andrea C. Arpaci-Dusseau,et al.  A Study of Linux File System Evolution , 2013, FAST.

[9]  Michael A. Bender,et al.  File Systems Fated for Senescence? Nonsense, Says Science! , 2017, FAST.

[10]  Andrea C. Arpaci-Dusseau,et al.  Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions , 2017, FAST.

[11]  Mark Lillibridge,et al.  Understanding the robustness of SSDS under power fault , 2013, FAST.

[12]  Changwoo Min,et al.  Cross-checking semantic correctness: the case of finding file system bugs , 2015, SOSP.

[13]  Robert Mateescu,et al.  Towards Robust File System Checkers , 2018, FAST.

[14]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[15]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[16]  Andrea C. Arpaci-Dusseau,et al.  All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications , 2014, OSDI.

[17]  Mai Zheng,et al.  Understanding the Fault Resilience of File System Checkers , 2017, HotStorage.

[18]  Jeffrey S. Vetter,et al.  Statistical scalability analysis of communication operations in distributed applications , 2001, PPoPP '01.

[19]  Pallavi Joshi,et al.  SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems , 2014, OSDI.

[20]  Arif Merchant,et al.  Flash Reliability in Production: The Expected and the Unexpected , 2016, FAST.

[21]  Dan Walsh,et al.  Design and implementation of the Sun network filesystem , 1985, USENIX Conference Proceedings.

[22]  Garth A. Gibson Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing '07 , 2007 .

[23]  Yong Chen,et al.  A Generic Framework for Testing Parallel File Systems , 2016, 2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS).

[24]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[25]  David R. O'Hallaron,et al.  //TRACE: Parallel Trace Replay with Approximate Causal Events , 2007, FAST.

[26]  Fred Douglis,et al.  RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures , 2015, FAST.

[27]  C. Partridge,et al.  Innovations in Internetworking , 1988 .

[28]  Philip C. Roth,et al.  Characterizing the I/O behavior of scientific applications on the Cray XT , 2007, PDSW '07.

[29]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[30]  Adam Chlipala,et al.  Using Crash Hoare logic for certifying the FSCQ file system , 2015, USENIX Annual Technical Conference.

[31]  John R. Douceur,et al.  Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[32]  Robert B. Ross,et al.  Fail-Slow at Scale , 2018, ACM Trans. Storage.

[33]  Jeffrey F. Naughton,et al.  Impact of disk corruption on open-source DBMS , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[34]  Mark Lillibridge,et al.  Torturing Databases for Fun and Profit , 2014, OSDI.

[35]  Daniel S. Katz,et al.  Web-based Tools -- Montage: An astronomical image mosaic engine , 2007 .