Reliability Assurance of RAID Storage Systems for a Wide Range of Latent Sector Errors

The low-cost disk drives, which are increasingly being adopted in today's data storage systems, have higher capacity but lower reliability, which leads to more frequent rebuilds and to a higher risk of unrecoverable or latent media errors. An intra-disk redundancy scheme has been proposed to cope with such errors and enhance the reliability of RAID systems. Empirical field results recently reported in the literature, however, suggest that the extent to which unrecoverable media errors occur is higher than the data sheet specifications provided by the disk manufacturers. Our results demonstrate that the reliability improvement due to intradisk redundancy is adversely affected because of the increase in the number of unrecoverable errors. We demonstrate that, by revising the parameter choice of the intradisk redundancy scheme, we can obtain essentially the same reliability as that of a system operating without unrecoverable sector errors. The I/O and throughput performance are evaluated by means of analysis and event-driven simulations. The effects of the spatial locality of errors and of the error-burst length distribution on the system reliability are also investigated.

[1]  Peter F. Corbett,et al.  Row-Diagonal Parity for Double Disk Failure Correction (Awarded Best Paper!) , 2004, USENIX Conference on File and Storage Technologies.

[2]  Ajay Dholakia,et al.  Analysis of a new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, SIGMETRICS/Performance.

[3]  Evangelos Eleftheriou,et al.  Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems , 2008, SIGMETRICS '08.

[4]  John Wilkes,et al.  An introduction to disk drive modeling , 1994, Computer.

[5]  S. Wittevrongel,et al.  Queueing Systems , 2019, Introduction to Stochastic Processes and Simulation.

[6]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[7]  Darren Charles Sawyer,et al.  Dependability analysis of parallel systems using a simulation-based approach. M.S. Thesis , 1994 .

[8]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[9]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[10]  Jehoshua Bruck,et al.  EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures , 1995, IEEE Trans. Computers.

[11]  Tapas Kanungo,et al.  IBM Research Report Performance Metrics for Erasure Codes in Storage Systems , 2004 .

[12]  Ajay Dholakia,et al.  A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, TOS.

[13]  M. Thomas Queueing Systems. Volume 1: Theory (Leonard Kleinrock) , 1976 .

[14]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.