Reliability Analysis of Declustered-Parity RAID 6 with Disk Scrubbing and Considering Irrecoverable Read Errors

We investigate the impact of Irrecoverable Read Errors (IREs) on Mean Time To Data Loss (MTTDL) of declustered-parity RAID 6 systems. By extending the analytic model to study the reliability of RAID 5 systems from Wu et. al. we obtain the MTTDL which mainly takes into account two types of data loss: data loss caused by three independent disk failures, and data loss due to a detected IRE during the rebuild after two disks failed. Furthermore we improve the analysis by also considering disk scrubbing to reduce the probability of IREs via periodically reading the data stored on a disk. The results of our numerical analysis show that IREs have a large effect on the MTTDL. The countermeasure is to increase the disk scrubbing rate. As an example, the MTTDL of a system where each disk is scrubbed everyday increases by a factor of at least 27 compared to that of a system with a scrubbing rate of once a year. In addition, declustered-parity RAID 6 system improves the reliability of standard non-declustered RAID 6 systems. For example, a declustered-parity RAID 6 system without disk scrubbing improves the MTTDLs by a factor at least 150 compared to that of a standard system where each disk is scrubbed everyday.

[1]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[2]  Jehoshua Bruck,et al.  EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures , 1995, IEEE Trans. Computers.

[3]  Ahmed Amer,et al.  Evaluating the Impact of Irrecoverable Read Errors on Disk Array Reliability , 2009, 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing.

[4]  Alexander Thomasian,et al.  Clustered RAID Arrays and Their Access Costs , 2005, Comput. J..

[5]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[6]  Hisao Kameda,et al.  Reliability Modeling of Declustered-Parity RAID Considering Uncorrectable Bit Errors , 1997 .

[7]  James S. Plank A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .

[8]  S. Shah,et al.  Reliability analysis of disk drive failure mechanisms , 2005, Annual Reliability and Maintainability Symposium, 2005. Proceedings..

[9]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[10]  Gang Fu,et al.  Performance of Two-Disk Failure-Tolerant Disk Arrays , 2007, IEEE Transactions on Computers.

[11]  Martin Raab,et al.  "Balls into Bins" - A Simple and Tight Analysis , 1998, RANDOM.

[12]  Ajay Dholakia,et al.  A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, TOS.

[13]  James S. Plank The RAID-6 Liberation Codes , 2008, FAST.

[14]  J. Sikora Disk failures in the real world : What does an MTTF of 1 , 000 , 000 hours mean to you ? , 2007 .

[15]  Peter F. Corbett,et al.  Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction , 2004 .

[16]  Ari Juels,et al.  A Clean-Slate Look at Disk Scrubbing , 2010, FAST.

[17]  Garth A. Gibson,et al.  Parity declustering for continuous operation in redundant disk arrays , 1992, ASPLOS V.