论文信息 - Failure Recovery Issues in Large Scale , Heavily Utilized Disk Storage Systems - 字舞流文

Failure Recovery Issues in Large Scale , Heavily Utilized Disk Storage Systems

Large data is increasingly important to large-scale computation and data analysis. Storage systems with petabytes of disk capacity are not uncommon in high-performance computing and internet services today and are expected to grow at 40-100% per year. These sizes and rates of growth render traditional, single-failure-tolerant (RAID 5) hardware controllers increasingly inappropriate. Stronger protection codes and parallel reconstruction based on parity declustering are techniques being employed to cope with weakening data reliability in these large-scale storage systems. The first tolerates more concurrent failures without data loss at the cost of increasing redundancy overhead. The second parallelizes failure recovery from the traditional per-subsystem hardware RAID reconstruction to parallel and distributed reconstruction over all disks and RAID controllers. This paper explores the differences and similarities between large-scale storage systems in high-performance computing (HPC) and data-intensive scalable computing (DISC) for internet services, and revises reliability models for these storage systems to incorporate stronger redundant encoding and the use of parallel reconstruction. A modern example, for systems of 1-5 petabytes, suggests that triplication can have as much as 10 times lower rates of lost data per year, even when its number of components has to be almost 3 times more for the same amount of user data, but that this difference may be as small as 1 to 10 bytes lost per year. Many might decide that this factor of ten is not significant in light of other sources of information loss.

Milo Polte | Garth Gibson | Paul Nowoczynski | Lin Xiao

[1] Yuan Yu,et al. Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[2] James S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .

[3] Lin Xiao,et al. In Search of an API for Scalable File Systems: Under the Table or Above It? , 2009, HotCloud.

[4] Carl Staelin,et al. The HP AutoRAID hierarchical storage system , 1995, SOSP.

[5] GhemawatSanjay,et al. The Google file system , 2003 .

[6] Shobhit Dayal,et al. Characterizing HEC Storage Systems at Rest , 2008 .

[7] Bianca Schroeder,et al. Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[8] John H. Hartman,et al. The Zebra striped network file system , 1995, TOCS.

[9] Robert B. Ross,et al. PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[10] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .

[11] Jehoshua Bruck,et al. EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures , 1995, IEEE Trans. Computers.

[12] Garth A. Gibson,et al. DiskReduce: RAID for data-intensive scalable computing , 2009, PDSW '09.

[13] P. Couvares. Caching in the Sprite network file system , 2006 .

[14] Ajay Dholakia,et al. Analysis of a new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, SIGMETRICS/Performance.

[15] Mahadev Satyanarayanan,et al. Scale and performance in a distributed file system , 1987, SOSP '87.

[16] Randy H. Katz,et al. Failure correction techniques for large disk arrays , 1989, ASPLOS III.

[17] Randy H. Katz,et al. A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[18] David Kotz,et al. Integrating Theory and Practice in Parallel File Systems , 1993 .

[19] John A. Kunze,et al. A trace-driven analysis of the UNIX 4.2 BSD file system , 1985, SOSP '85.

[20] Bin Zhou,et al. Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[21] Eduardo Pinheiro,et al. Failure Trends in a Large Disk Drive Population , 2007, FAST.

[22] Jehoshua Bruck,et al. Computing in the RAIN: A Reliable Array of Independent Nodes , 2000, IPDPS Workshops.

[23] Catherine D. Schuman,et al. A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage , 2009, FAST.

[24] Daniel P. Siewiorek,et al. Architectures and algorithms for on-line failure recovery in redundant disk arrays , 1994, Distributed and Parallel Databases.

[25] Garth A. Gibson,et al. Data-intensive File Systems for Internet Services: A Rose by Any Other Name... (CMU-PDL-08-114) , 2008 .

[26] Shankar Pasupathy,et al. An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[27] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[28] Dennis McEvoy. The architecture of Tandem's NonStop system , 1981, ACM '81.

[29] Garth A. Gibson,et al. RAID: high-performance, reliable secondary storage , 1994, CSUR.

[30] Peter F. Corbett,et al. Row-Diagonal Parity for Double Disk Failure Correction (Awarded Best Paper!) , 2004, USENIX Conference on File and Storage Technologies.

[31] James Lee Hafner,et al. WEAVER codes: highly fault tolerant erasure codes for storage systems , 2005, FAST'05.

[32] Sanjay Ghemawat,et al. MapReduce: simplified data processing on large clusters , 2008, CACM.

[33] Margo I. Seltzer,et al. Heuristic Cleaning Algorithms in Log-Structured File Systems , 1995, USENIX.

[34] Thomas R. Gross,et al. Combining the concepts of compression and caching for a two-level filesystem , 1991, ASPLOS IV.

[35] John Kubiatowicz,et al. Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[36] Stefan Savage,et al. AFRAID - A Frequently Redundant Array of Independent Disks , 1996, USENIX Annual Technical Conference.

[37] Darrell D. E. Long,et al. Swift/RAID: A Distributed RAID System , 1994, Comput. Syst..

[38] Mendel Rosenblum,et al. The design and implementation of a log-structured file system , 1991, SOSP '91.

[39] Dhiraj K. Pradhan,et al. Roll-Forward and Rollback Recovery: Performance-Reliability Trade-Off , 1997, IEEE Trans. Computers.