RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures

Modern storage systems orchestrate a group of disks to achieve their performance and reliability goals. Even though such systems are designed to withstand the failure of individual disks, failure of multiple disks poses a unique set of challenges. We empirically investigate disk failure data from a large number of production systems, specifically focusing on the impact of disk failures on RAID storage systems. Our data covers about one million SATA disks from six disk models for periods up to 5 years. We show how observed disk failures weaken the protection provided by RAID. The count of reallocated sectors correlates strongly with impending failures. With these findings we designed RAIDS hield , which consists of two components. First, we have built and evaluated an active defense mechanism that monitors the health of each disk and replaces those that are predicted to fail imminently. This proactive protection has been incorporated into our product and is observed to eliminate 88p of triple disk errors, which are 80p of all RAID failures. Second, we have designed and simulated a method of using the joint failure probability to quantify and predict how likely a RAID group is to face multiple simultaneous disk failures, which can identify disks that collectively represent a risk of failure even when no individual disk is flagged in isolation. We find in simulation that RAID-level analysis can effectively identify most vulnerable RAID-6 systems, improving the coverage to 98p of triple errors. We conclude with discussions of operational considerations in deploying RAIDS hield more broadly and new directions in the analysis of disk errors. One interesting approach is to combine multiple metrics, allowing the values of different indicators to be used for predictions. Using newer field data that reports an additional metric, medium errors, we find that the relative efficacy of reallocated sectors and medium errors varies across disk models, offering an additional way to predict failures.

[1]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[2]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[3]  Andrea C. Arpaci-Dusseau,et al.  An analysis of data corruption in the storage stack , 2008, TOS.

[4]  Joseph F. Murray,et al.  Hard drive failure prediction using non-parametric statistical methods , 2003 .

[5]  Ajay Dholakia,et al.  A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, TOS.

[6]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[7]  Fabrizio Lombardi,et al.  Detection of defective media in disks , 1993, Proceedings of 1993 IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems.

[8]  James Lee Hafner,et al.  WEAVER codes: highly fault tolerant erasure codes for storage systems , 2005, FAST'05.

[9]  Andrea C. Arpaci-Dusseau,et al.  Parity Lost and Parity Regained , 2008, FAST.

[10]  James Lee Hafner,et al.  Matrix methods for lost data reconstruction in erasure codes , 2005, FAST'05.

[11]  Jon G. Elerath Hard Disk Drives: The Good, the Bad and the Ugly! , 2007, ACM Queue.

[12]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[13]  Michal Kaczmarczyk,et al.  HYDRAstor: A Scalable Secondary Storage , 2009, FAST.

[14]  Moisés Goldszmidt Finding Soon-to-Fail Disks in a Haystack , 2012, HotStorage.

[15]  Ahmed Amer,et al.  Reliability challenges for storing exabytes , 2014, 2014 International Conference on Computing, Networking and Communications (ICNC).

[16]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[17]  Jiwu Shu,et al.  GRID codes: Strip-based erasure codes with high fault tolerance for storage systems , 2009, TOS.

[18]  James S. Plank,et al.  Mean Time to Meaningless: MTTDL, Markov Models, and Storage System Reliability , 2010, HotStorage.

[19]  David A. Patterson,et al.  An Analysis of Error Behaviour in a Large Storage System , 1999 .

[20]  Fred Douglis,et al.  Characteristics of backup workloads in production systems , 2012, FAST.

[21]  Mario Blaum,et al.  Sector-Disk (SD) Erasure Codes for Mixed Failure Modes in RAID Systems , 2014, TOS.

[22]  Bruce Allen,et al.  Monitoring hard disks with smart , 2004 .

[23]  Hannu H. Kari Latent Sector Faults and Reliability of Disk Arrays , 2005 .

[24]  Mario Blaum,et al.  SD codes: erasure codes designed for how storage systems really fail , 2013, FAST.

[25]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[26]  Erik Riedel,et al.  More Than an Interface - SCSI vs. ATA , 2003, FAST.

[27]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[28]  GhemawatSanjay,et al.  The Google file system , 2003 .

[29]  Lisa Spainhower,et al.  Commercial fault tolerance: a tale of two systems , 2004, IEEE Transactions on Dependable and Secure Computing.

[30]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[31]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[32]  Peter F. Corbett,et al.  Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction , 2004 .

[33]  Steve R. Kleiman,et al.  SnapMirror: File-System-Based Asynchronous Mirroring for Disaster Recovery , 2002, FAST.

[34]  Garth A. Gibson Redundant disk arrays: Reliable, parallel secondary storage. Ph.D. Thesis , 1990 .

[35]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[36]  H KatzRandy,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988 .

[37]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[38]  Andrea C. Arpaci-Dusseau,et al.  Fail-stutter fault tolerance , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[39]  G. A. Alvarez,et al.  Tolerating Multiple Failures In Raid Architectures With Optimal Storage And Uniform Declustering , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[40]  Michael Dahlin,et al.  TAPER: tiered approach for eliminating redundancy in replica synchronization , 2005, FAST'05.

[41]  Ahmed Amer,et al.  Improving Disk Array Reliability Through Expedited Scrubbing , 2010, 2010 IEEE Fifth International Conference on Networking, Architecture, and Storage.

[42]  Peter F. Corbett,et al.  RAID triple parity , 2012, OPSR.

[43]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[44]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[45]  Cheng Huang,et al.  Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads , 2012, FAST.

[46]  Jiri Schindler,et al.  Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation , 2014, TOS.

[47]  C. Walter Kryder's law. , 2005, Scientific American.