Understanding and coping with failures in large-scale storage systems

Reliability for very large-scale storage systems has become more and more important as the need for storage has grown dramatically. New phenomena related to system reliability appear as systems scale up. In such a system, failures are a normality. In order to ensure high reliability for petabyte-scale storage systems in scientific applications, characterization of failures and techniques of coping with them are studied in this thesis. The thesis first describes the architecture of a petabyte-scale storage system and characterizes the challenges of achieving high reliability in such a system. The long disk recovery time and the large number of system components are identified as the main obstacles against high system reliability. The thesis then presents a fast recovery mechanism, FARM, which greatly reduces data loss in the occurrence of multiple disk failures. Reliability of a petabyte-scale system with and without FARM has been evaluated. Accordingly, various aspects of system reliability, such as failure detection latency, bandwidth utilization for recovery, disk space utilization, and system scale, have been examined by simulations. The overall system reliability is modeled and estimated by quantitative analysis based on Markov models and event-driven simulations. It is found that disk failure models which take infant mortality into consideration result in more precise reliability estimation than the traditional model which assumes a constant failure rate, since infant mortality has a pronounced impact on petabyte-scale systems. To safeguard data against failures from young disk drives, an adaptive data redundancy scheme is presented and evaluated. A petabyte-scale storage system is typically built up by thousands of components in a complicated interconnect structure. The impact of various failures on the interconnection networks is gauged and the performance and robustness under degraded modes are evaluated in a simulated petabyte-scale storage system with different configurations of network topology. This thesis is directed towards understanding and coping with failures in petabyte-scale storage systems. It addresses several emerging reliability challenges posed by the increasing scale of storage systems and study the methods to improving system reliability. The research is targeted to help system architects in the designs of reliable storage systems at petabyte-scale and beyond.

[1]  Andreas Haeberlen,et al.  Glacier: highly durable, decentralized storage despite massive correlated failures , 2005, NSDI.

[2]  Daniel P. Siewiorek,et al.  Architectures and algorithms for on-line failure recovery in redundant disk arrays , 1994, Distributed and Parallel Databases.

[3]  John H. Hartman,et al.  The Zebra striped network file system , 1995, TOCS.

[4]  Yasushi Saito,et al.  Pangaea: a symbiotic wide-area file system , 2002, EW 10.

[5]  Peter F. Corbett,et al.  Row-Diagonal Parity for Double Disk Failure Correction (Awarded Best Paper!) , 2004, USENIX Conference on File and Storage Technologies.

[6]  Kishor S. Trivedi,et al.  Markov Dependability Models of Complex Systems: Analysis Techniques , 1996 .

[7]  Rodney Van Meter,et al.  Network attached storage architecture , 2000, CACM.

[8]  Ethan L. Miller,et al.  Replication under scalable hashing: a family of algorithms for scalable decentralized data distribution , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[9]  Antony I. T. Rowstron,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001, SOSP.

[10]  Witold Litwin,et al.  High-availability LH* schemes with mirroring , 1996, Proceedings First IFCIS International Conference on Cooperative Information Systems.

[11]  Srinivasan Seshan,et al.  Performance and design evaluation of the RAID-II storage server , 2005, Distributed and Parallel Databases.

[12]  Junfeng Yang,et al.  An empirical study of operating systems errors , 2001, SOSP.

[13]  Satoshi Matsuoka,et al.  Performance analysis of scheduling and replication algorithms on Grid Datafarm architecture for high-energy physics applications , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[14]  S. Shah,et al.  Server class disk drives: how reliable are they? , 2004, Annual Symposium Reliability and Maintainability, 2004 - RAMS.

[15]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[16]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[17]  Kanishk Jain Object-based Storage , 2022 .

[18]  Ian T. Foster,et al.  Mapping the Gnutella Network , 2002, IEEE Internet Comput..

[19]  Garth A. Gibson Redundant disk arrays: Reliable, parallel secondary storage. Ph.D. Thesis , 1990 .

[20]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[21]  Thomas J. Glover,et al.  Pocket PCRef , 1991 .

[22]  Walter A. Burkhard,et al.  Disk array storage system reliability , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[23]  David V. Anderson Object based storage devices: a command set proposal , 1999 .

[24]  Niraj K. Jha,et al.  Fault-tolerant computer system design , 1996, IEEE Parallel & Distributed Technology: Systems & Applications.

[25]  José Duato A Theory of Fault-Tolerant Routing in Wormhole Networks , 1997, IEEE Trans. Parallel Distributed Syst..

[26]  John C. S. Lui,et al.  Performance Analysis of Disk Arrays under Failure , 1990, VLDB.

[27]  J. G. Elerath Specifying reliability in the disk drive industry: No more MTBF's , 2000, Annual Reliability and Maintainability Symposium. 2000 Proceedings. International Symposium on Product Quality and Integrity (Cat. No.00CH37055).

[28]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[29]  Darrell D. E. Long,et al.  Exploiting Multiple I/O Streams to Provide High Data-Rates , 1991, USENIX Summer.

[30]  Witold Litwin,et al.  Algebraic signatures for scalable distributed data structures , 2004, Proceedings. 20th International Conference on Data Engineering.

[31]  Tore Risch,et al.  LH* Schemes with Scalable Availability , 1998 .

[32]  Boris Vladimirovič Gnedenko,et al.  Mathematical methods in the reliability theory , 1969 .

[33]  David A. Patterson,et al.  Embracing Failure: A Case for Recovery-Oriented Computing (ROC) , 2001 .

[34]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[35]  Ben Y. Zhao,et al.  Maintenance-Free Global Data Storage , 2001, IEEE Internet Comput..

[36]  J. Menon,et al.  Distributed sparing in disk arrays , 1992, Digest of Papers COMPCON Spring 1992.

[37]  Randy H. Katz,et al.  RAMA: a file system for massively-parallel computers , 1993, [1993] Proceedings Twelfth IEEE Symposium on Mass Storage systems.

[38]  Ian Clarke,et al.  Freenet: A Distributed Anonymous Information Storage and Retrieval System , 2000, Workshop on Design Issues in Anonymity and Unobservability.

[39]  Miguel Castro,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[40]  Spencer W. Ng Crosshatch disk array for improved reliability and performance , 1994, ISCA '94.

[41]  Magnus Karlsson,et al.  Taming aggressive replication in the Pangaea wide-area file system , 2002, OPSR.

[42]  Darrell D. E. Long,et al.  Swift: Using Distributed Disk Striping to Provide High I/O Data Rates , 1991, Comput. Syst..

[43]  Roger Wattenhofer,et al.  Large-scale simulation of replica placement algorithms for a serverless distributed file system , 2001, MASCOTS 2001, Proceedings Ninth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[44]  Thomas J. E. Schwarz Reed Solomon codes for Erasure Correction in SDDS , 2002, WDAS.

[45]  Thomas E. Anderson,et al.  xFS: a wide area mass storage file system , 1993, Proceedings of IEEE 4th Workshop on Workstation Operating Systems. WWOS-III.

[46]  Miguel Castro,et al.  Proactive recovery in a Byzantine-fault-tolerant system , 2000, OSDI.

[47]  E. L. Miller,et al.  Efficient Metadata Management in Large Distributed File Systems , .

[48]  Sung Hoon Baek,et al.  Reliability and performance of hierarchical RAID with multiple controllers , 2001, PODC '01.

[49]  John Wilkes,et al.  Seneca: remote mirroring done write , 2003, USENIX Annual Technical Conference, General Track.

[50]  J. G. Elerath,et al.  Disk drive reliability case study: dependence upon head fly-height and quantity of heads , 2003, Annual Reliability and Maintainability Symposium, 2003..

[51]  Roger Wattenhofer,et al.  Optimizing file availability in a secure serverless distributed file system , 2001, Proceedings 20th IEEE Symposium on Reliable Distributed Systems.

[52]  John I. McCool,et al.  Probability and Statistics With Reliability, Queuing and Computer Science Applications , 2003, Technometrics.

[53]  James S. Plank,et al.  A tutorial on Reed–Solomon coding for fault‐tolerance in RAID‐like systems , 1997, Softw. Pract. Exp..

[54]  Charles L. Seitz,et al.  Multicomputers: message-passing concurrent computers , 1988, Computer.

[55]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[56]  Arif Merchant,et al.  FAB: building distributed enterprise disk arrays from commodity components , 2004, ASPLOS XI.

[57]  Sharon E. Perl,et al.  Myriad: Cost-Effective Disaster Tolerance , 2002, FAST.

[58]  Feng Wang,et al.  File System Workload Analysis For Large Scale Scientific Com puting Applications , 2004 .

[59]  Xiang Yu,et al.  Configuring and Scheduling an Eager-Writing Disk Array for a Transaction Processing Workload , 2002, FAST.

[60]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[61]  C. Mohan,et al.  Recovery and Coherency-Control Protocols for Fast Intersystem Page Transfer and Fine-Granularity Locking in a Shared Disks Transaction Environment , 1991, VLDB.

[62]  Darrell D. E. Long A technique for managing mirrored disks , 2001, Conference Proceedings of the 2001 IEEE International Performance, Computing, and Communications Conference (Cat. No.01CH37210).

[63]  John R. Douceur,et al.  The Sybil Attack , 2002, IPTPS.

[64]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[65]  Carl Staelin,et al.  Idleness is Not Sloth , 1995, USENIX.

[66]  S. Shah,et al.  Disk drive vintage and its effect on reliability , 2004, Annual Symposium Reliability and Maintainability, 2004 - RAMS.

[67]  Matthew T. O'Keefe,et al.  Scalability and Failure Recovery in a Linux Cluster File System , 2000, Annual Linux Showcase & Conference.

[68]  Scott A. Brandt,et al.  Dynamic Metadata Management for Petabyte-Scale File Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[69]  G. A. Alvarez,et al.  Tolerating Multiple Failures In Raid Architectures With Optimal Storage And Uniform Declustering , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[70]  Chandramohan A. Thekkath,et al.  Frangipani: a scalable distributed file system , 1997, SOSP.

[71]  Roger Wattenhofer,et al.  Competitive Hill-Climbing Strategies for Replica Placement in a Distributed File System , 2001, DISC.

[72]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[73]  Garth Goodson,et al.  Efficient, Scalable Consistency for Highly Fault-tolerant Storage (CMU-PDL-04-111) , 2004 .

[74]  Dror G. Feitelson,et al.  The Vesta parallel file system , 1996, TOCS.

[75]  Jehoshua Bruck,et al.  EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures , 1995, IEEE Trans. Computers.

[76]  Erik Riedel,et al.  More Than an Interface - SCSI vs. ATA , 2003, FAST.

[77]  Randy H. Katz,et al.  How reliable is a RAID? , 1989, Digest of Papers. COMPCON Spring 89. Thirty-Fourth IEEE Computer Society International Conference: Intellectual Leverage.

[78]  GhemawatSanjay,et al.  The Google file system , 2003 .

[79]  Randy H. Katz,et al.  RAMA: An Easy-to-Use, High-Performance Parallel File System , 1997, Parallel Comput..

[80]  天野 英晴 J. L. Hennessy and D. A. Patterson: Computer Architecture: A Quantitative Approach, Morgan Kaufmann (1990)(20世紀の名著名論) , 2003 .

[81]  Li Zhang,et al.  Fault tolerant networks with small degree , 2000, SPAA '00.

[82]  Randy H. Katz,et al.  Coding techniques for handling failures in large disk arrays , 2005, Algorithmica.

[83]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[84]  Edward Grochowski,et al.  Technological impact of magnetic hard disk drives on storage systems , 2003, IBM Syst. J..

[85]  Chita R. Das,et al.  A Testbed for Evaluation of Fault-Tolerant Routing in Multiprocessor Interconnection Networks , 1999, IEEE Trans. Parallel Distributed Syst..

[86]  Daniel P. Siewiorek,et al.  Reliable computer systems (2nd ed.): design and evaluation , 1992 .

[87]  Yale N. Patt,et al.  Disk subsystem load balancing: disk striping vs. conventional data placement , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[88]  Michael Stonebraker,et al.  Distributed RAID-a new multiple copy algorithm , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[89]  Hai Jin,et al.  RAID-x: a new distributed disk array for I/O-centric cluster computing , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[90]  Tak-Shing Peter Yum,et al.  Dynamic Multiple Parity (DMP) Disk Array for Serial Transaction Processing , 2001, IEEE Trans. Computers.

[91]  Jim Zelenka,et al.  File server scaling with network-attached secure disks , 1997, SIGMETRICS '97.

[92]  Ethan L. Miller,et al.  Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems , 2004 .

[93]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[94]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[95]  Jim Zelenka,et al.  A cost-effective, high-bandwidth storage architecture , 1998, ASPLOS VIII.

[96]  Yale N. Patt,et al.  Using non-volatile storage to improve the reliability of RAID5 disk arrays , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[97]  Mike Loukides,et al.  Using SANs and NAS , 2002 .

[98]  Kishor S. Trivedi,et al.  Reliabilities of two fault-tolerant interconnection networks , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[99]  Witold Litwin,et al.  LH*RS: a high-availability scalable distributed data structure using Reed Solomon Codes , 2000, SIGMOD '00.

[100]  Kishor S. Trivedi,et al.  FSPNs: Fluid Stochastic Petri Nets , 1993, Application and Theory of Petri Nets.

[101]  Witold Litwin,et al.  LH*s: a high-availability and high-security scalable distributed data structure , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[102]  Walter A. Burkhard,et al.  Reliability and performance of RAIDs , 1995 .

[103]  Aaron Brown Accepting Failure: Availability through Repair-centric System Design , 2001 .

[104]  Amin Vahdat,et al.  Interposed request routing for scalable network storage , 2000, TOCS.

[105]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[106]  Prasant Mohapatra,et al.  Wormhole routing techniques for directly connected multicomputer systems , 1998, CSUR.

[107]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[108]  H. Apte,et al.  Serverless Network File Systems , 2006 .

[109]  Gustavo Alonso,et al.  Understanding replication in databases and distributed systems , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.

[110]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[111]  David A. Patterson,et al.  An Analysis of Error Behaviour in a Large Storage System , 1999 .

[112]  Ben Y. Zhao,et al.  Pond: The OceanStore Prototype , 2003, FAST.