Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems

Two schemes proposed to cope with unrecoverable or latent media errors and enhance the reliability of RAID systems are examined. The first scheme is the established, widely used, disk scrubbing scheme, which operates by periodically accessing disk drives to detect media-related unrecoverable errors. These errors are subsequently corrected by rebuilding the sectors affected. The second scheme is the recently proposed intradisk redundancy scheme, which uses a further level of redundancy inside each disk, in addition to the RAID redundancy across multiple disks. A new model is developed to evaluate the extent to which disk scrubbing reduces the unrecoverable sector errors. The probability of encountering unrecoverable sector errors is derived analytically under very general conditions regarding the characteristics of the read/write process of uniformly distributed random workloads and for a broad spectrum of disk scrubbing schemes, which includes the deterministic and random scrubbing schemes. We show that the deterministic scrubbing scheme is the most efficient one. We also derive closed-form expressions for the percentage of unrecoverable sector errors that the scrubbing scheme detects and corrects, the throughput performance, and the minimum scrubbing period achievable under operation with random, uniformly distributed I/O requests. Our results demonstrate that the reliability improvement due to disk scrubbing depends on the scrubbing frequency and the load of the system, and, for heavy-write workloads, may not reach the reliability level achieved by a simple interleaved parity-check (IPC)-based intradisk redundancy scheme, which is insensitive to the load. In fact, for small unrecoverable sector error probabilities, the IPC-based intradisk redundancy scheme achieves essentially the same reliability as that of a system operating without unrecoverable sector errors. For heavy loads, the reliability achieved by the scrubbing scheme can be orders of magnitude less than that of the intradisk redundancy scheme. Finally, the I/O and throughput performances are evaluated by means of analysis and event-driven simulation.

[1]  Darrell D. E. Long,et al.  Using device diversity to protect data against batch-correlated disk failures , 2006, StorageSS '06.

[2]  Xiao-Yu Hu,et al.  Reliability Assurance of RAID Storage Systems for a Wide Range of Latent Sector Errors , 2008, 2008 International Conference on Networking, Architecture, and Storage.

[3]  Leonard Kleinrock,et al.  Queueing Systems: Volume I-Theory , 1975 .

[4]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[5]  Ari Juels,et al.  A Clean-Slate Look at Disk Scrubbing , 2010, FAST.

[6]  Jehoshua Bruck,et al.  EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures , 1995, IEEE Trans. Computers.

[7]  Mario Blaum,et al.  Higher reliability redundant disk arrays: Organization, operation, and coding , 2009, TOS.

[8]  Guanying Wang,et al.  On the Impact of Disk Scrubbing on Energy Savings , 2008, HotPower.

[9]  Ajay Dholakia,et al.  A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, TOS.

[10]  Evangelos Eleftheriou,et al.  Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems , 2008, SIGMETRICS '08.

[11]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[12]  Alma Riska,et al.  Idle Read After Write - IRAW , 2008, USENIX Annual Technical Conference.

[13]  James S. Plank,et al.  Mean Time to Meaningless: MTTDL, Markov Models, and Storage System Reliability , 2010, HotStorage.

[14]  Tapas Kanungo,et al.  IBM Research Report Performance Metrics for Erasure Codes in Storage Systems , 2004 .

[15]  Michael G. Pecht,et al.  Enhanced Reliability Modeling of RAID Storage Systems , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[16]  Ilias Iliadis Reliability modeling of RAID storage systems with latent errors , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[17]  S. Shah,et al.  Reliability analysis of disk drive failure mechanisms , 2005, Annual Reliability and Maintainability Symposium, 2005. Proceedings..

[18]  John Wilkes,et al.  An introduction to disk drive modeling , 1994, Computer.

[19]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[20]  S. Wittevrongel,et al.  Queueing Systems , 2019, Introduction to Stochastic Processes and Simulation.

[21]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[22]  Alma Riska,et al.  Enhancing data availability in disk drives through background activities , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[23]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[24]  Mary Baker,et al.  A fresh look at the reliability of long-term digital storage , 2005, EuroSys.

[25]  Ronald W. Wolff,et al.  Stochastic Modeling and the Theory of Queues , 1989 .

[26]  Darren Charles Sawyer,et al.  Dependability analysis of parallel systems using a simulation-based approach. M.S. Thesis , 1994 .

[27]  Alma Riska,et al.  Disk Drive Level Workload Characterization , 2006, USENIX Annual Technical Conference, General Track.

[28]  Bianca Schroeder,et al.  Understanding latent sector errors and how to protect against them , 2010, TOS.

[29]  Ajay Dholakia,et al.  Analysis of a new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, SIGMETRICS/Performance.

[30]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[31]  IliadisIlias,et al.  Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems , 2011 .

[32]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.