Enhanced Reliability Modeling of RAID Storage Systems

A flexible model for estimating reliability of RAID storage systems is presented. This model corrects errors associated with the common assumption that system times to failure follow a homogeneous Poisson process. Separate generalized failure distributions are used to model catastrophic failures and usage dependent data corruptions for each hard drive. Catastrophic failure restoration is represented by a three-parameter Weibull, so the model can include a minimum time to restore as a function of data transfer rate and hard drive storage capacity. Data can be scrubbed as a background operation to eliminate corrupted data that, in the event of a simultaneous catastrophic failure, results in double disk failures. Field-based times to failure data and mathematic justification for a new model are presented. Model results have been verified and predict between 2 to 1,500 times as many double disk failures as that estimated using the current mean time to data loss method.

[1]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[2]  H. Ascher [Statistical Methods in Reliability]: Discussion , 1983 .

[3]  Walter A. Burkhard,et al.  Reliability and performance of RAIDs , 1995 .

[4]  Larry H. Crow,et al.  Evaluating the reliability of repairable systems , 1990, Annual Proceedings on Reliability and Maintainability Symposium.

[5]  Kishor S. Trivedi,et al.  An analytic treatment of the reliability and performance of mirrored disk subsystems , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[6]  Courtright,et al.  A Transactional Approach to Redundant Disk Array Implementation (CMU-CS-97-141) , 1998 .

[7]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[8]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[9]  Jim Gray,et al.  Empirical Measurements of Disk Failure Rates and Error Rates , 2007, ArXiv.

[10]  Wayne Nelson,et al.  Graphical Analysis of System Repair Data , 1988 .

[11]  H KatzRandy,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988 .

[12]  S. Nathan,et al.  Simple plots for monitoring the field reliability of repairable systems , 2005, Annual Reliability and Maintainability Symposium, 2005. Proceedings..

[13]  Jerald F. Lawless,et al.  Statistical Methods in Reliability , 1983 .

[14]  Randy H. Katz,et al.  Introduction to redundant arrays of inexpensive disks (RAID) , 1989, Digest of Papers. COMPCON Spring 89. Thirty-Fourth IEEE Computer Society International Conference: Intellectual Leverage.

[15]  Manish Malhotra,et al.  Specification and Solution of Dependability Models of Fault-tolerant Systems , 1993 .

[16]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[17]  H. E. Ascher A set-of-numbers is NOT a data-set , 1999 .

[18]  J. G. Elerath,et al.  Disk drive reliability case study: dependence upon head fly-height and quantity of heads , 2003, Annual Reliability and Maintainability Symposium, 2003..

[19]  Peter F. Corbett,et al.  Row-Diagonal Parity for Double Disk Failure Correction (Awarded Best Paper!) , 2004, USENIX Conference on File and Storage Technologies.

[20]  Hannu H. Kari Latent Sector Faults and Reliability of Disk Arrays , 2005 .

[21]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[22]  S. Shah,et al.  Disk drive vintage and its effect on reliability , 2004, Annual Symposium Reliability and Maintainability, 2004 - RAMS.

[23]  W. A. Thompson,et al.  On the Foundations of Reliability , 1981 .

[24]  J. Sikora Disk failures in the real world : What does an MTTF of 1 , 000 , 000 hours mean to you ? , 2007 .

[25]  S. Shah,et al.  Reliability analysis of disk drive failure mechanisms , 2005, Annual Reliability and Maintainability Symposium, 2005. Proceedings..

[26]  Djalma M. Falcao,et al.  Composite reliability evaluation by sequential Monte Carlo simulation on parallel and distributed processing environments , 2001 .