Evaluating the Impact of Irrecoverable Read Errors on Disk Array Reliability

We investigate the impact of irrecoverable read errors--also known as bad blocks--on the MTTDL of mirrored disks, RAID level 5 arrays and RAID level 6 arrays. Our study is based on the data collected by Bairavasundaram et al. from a population of 1.53 million disks over a period of 32 months. Our study indicates that irrecoverable read errors can reduce the mean time to data loss (MTTDL) of the three arrays by up to 99 percent, effectively canceling most of the benefits of fast disk repairs. It also shows the benefits of frequent scrubbing scans that map out bad blocks thus preventing future irrecoverable read errors. As an example, once-a-month scrubbing scans were found to improve the MTTDL of the three arrays by at least 300 percent compared to once-a-year scrubbing scans.

[1]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[2]  Randy H. Katz,et al.  How reliable is a RAID? , 1989, Digest of Papers. COMPCON Spring 89. Thirty-Fourth IEEE Computer Society International Conference: Intellectual Leverage.

[3]  Darrell D. E. Long,et al.  Protecting against rare event failures in archival systems , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[4]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[5]  Garth A. Gibson Redundant disk arrays: Reliable, parallel secondary storage. Ph.D. Thesis , 1990 .

[6]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[7]  Jon G. Elerath A simple equation for estimating reliability of an N+1 redundant array of independent disks (RAID) , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[8]  Walter A. Burkhard,et al.  Disk array storage system reliability , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[9]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[10]  Jon G. Elerath Hard Disk Drives: The Good, the Bad and the Ugly! , 2007, ACM Queue.

[11]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[12]  Mary Baker,et al.  A fresh look at the reliability of long-term digital storage , 2005, EuroSys.

[13]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[14]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[15]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[16]  Ahmed Amer,et al.  When MTTDLs Are Not Good Enough : Providing Better Estimates of Disk Array Reliability , 2008 .

[17]  Walter A. Burkhard,et al.  RAID organization and performance , 1992, [1992] Proceedings of the 12th International Conference on Distributed Computing Systems.