Failure Recovery Issues in Large Scale , Heavily Utilized Disk Storage Systems

Large data is increasingly important to large-scale computation and data analysis. Storage systems with petabytes of disk capacity are not uncommon in high-performance computing and internet services today and are expected to grow at 40-100% per year. These sizes and rates of growth render traditional, single-failure-tolerant (RAID 5) hardware controllers increasingly inappropriate. Stronger protection codes and parallel reconstruction based on parity declustering are techniques being employed to cope with weakening data reliability in these large-scale storage systems. The first tolerates more concurrent failures without data loss at the cost of increasing redundancy overhead. The second parallelizes failure recovery from the traditional per-subsystem hardware RAID reconstruction to parallel and distributed reconstruction over all disks and RAID controllers. This paper explores the differences and similarities between large-scale storage systems in high-performance computing (HPC) and data-intensive scalable computing (DISC) for internet services, and revises reliability models for these storage systems to incorporate stronger redundant encoding and the use of parallel reconstruction. A modern example, for systems of 1-5 petabytes, suggests that triplication can have as much as 10 times lower rates of lost data per year, even when its number of components has to be almost 3 times more for the same amount of user data, but that this difference may be as small as 1 to 10 bytes lost per year. Many might decide that this factor of ten is not significant in light of other sources of information loss.

[1]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[2]  James S. Plank A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .

[3]  Lin Xiao,et al.  In Search of an API for Scalable File Systems: Under the Table or Above It? , 2009, HotCloud.

[4]  Carl Staelin,et al.  The HP AutoRAID hierarchical storage system , 1995, SOSP.

[5]  GhemawatSanjay,et al.  The Google file system , 2003 .

[6]  Shobhit Dayal,et al.  Characterizing HEC Storage Systems at Rest , 2008 .

[7]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[8]  John H. Hartman,et al.  The Zebra striped network file system , 1995, TOCS.

[9]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[10]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[11]  Jehoshua Bruck,et al.  EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures , 1995, IEEE Trans. Computers.

[12]  Garth A. Gibson,et al.  DiskReduce: RAID for data-intensive scalable computing , 2009, PDSW '09.

[13]  P. Couvares Caching in the Sprite network file system , 2006 .

[14]  Ajay Dholakia,et al.  Analysis of a new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, SIGMETRICS/Performance.

[15]  Mahadev Satyanarayanan,et al.  Scale and performance in a distributed file system , 1987, SOSP '87.

[16]  Randy H. Katz,et al.  Failure correction techniques for large disk arrays , 1989, ASPLOS III.

[17]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[18]  David Kotz,et al.  Integrating Theory and Practice in Parallel File Systems , 1993 .

[19]  John A. Kunze,et al.  A trace-driven analysis of the UNIX 4.2 BSD file system , 1985, SOSP '85.

[20]  Bin Zhou,et al.  Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[21]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[22]  Jehoshua Bruck,et al.  Computing in the RAIN: A Reliable Array of Independent Nodes , 2000, IPDPS Workshops.

[23]  Catherine D. Schuman,et al.  A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage , 2009, FAST.

[24]  Daniel P. Siewiorek,et al.  Architectures and algorithms for on-line failure recovery in redundant disk arrays , 1994, Distributed and Parallel Databases.

[25]  Garth A. Gibson,et al.  Data-intensive File Systems for Internet Services: A Rose by Any Other Name... (CMU-PDL-08-114) , 2008 .

[26]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[27]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[28]  Dennis McEvoy The architecture of Tandem's NonStop system , 1981, ACM '81.

[29]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[30]  Peter F. Corbett,et al.  Row-Diagonal Parity for Double Disk Failure Correction (Awarded Best Paper!) , 2004, USENIX Conference on File and Storage Technologies.

[31]  James Lee Hafner,et al.  WEAVER codes: highly fault tolerant erasure codes for storage systems , 2005, FAST'05.

[32]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[33]  Margo I. Seltzer,et al.  Heuristic Cleaning Algorithms in Log-Structured File Systems , 1995, USENIX.

[34]  Thomas R. Gross,et al.  Combining the concepts of compression and caching for a two-level filesystem , 1991, ASPLOS IV.

[35]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[36]  Stefan Savage,et al.  AFRAID - A Frequently Redundant Array of Independent Disks , 1996, USENIX Annual Technical Conference.

[37]  Darrell D. E. Long,et al.  Swift/RAID: A Distributed RAID System , 1994, Comput. Syst..

[38]  Mendel Rosenblum,et al.  The design and implementation of a log-structured file system , 1991, SOSP '91.

[39]  Dhiraj K. Pradhan,et al.  Roll-Forward and Rollback Recovery: Performance-Reliability Trade-Off , 1997, IEEE Trans. Computers.