Reliability modeling of RAID storage systems with latent errors

The reliability of disk storage systems is adversely affected by the presence of latent sector errors. Disk scrubbing and intradisk redundancy are two schemes proposed to cope with unrecoverable or latent media errors and enhance the reliability of RAID storage systems. Two recent studies have investigated the effectiveness of these schemes, but they have reached opposing conclusions. These studies were conducted using two different modeling approaches. We present a detailed investigation which reveals that this discrepancy originates from the difference in the approach adopted, and the level of detail incorporated by the two models. We show that, as a consequence, these models provide reliability results which may differ by orders of magnitude therefore leading to contradicting conclusions. We develop a common analytical framework within which we investigate the details, merits, weaknesses, and applicability of each model. We resolve this discrepancy by deriving enhanced models that incorporate inherent characteristics of the latent-error process and provide realistic reliability results that are in good agreement. We subsequently reassess the reliability results and conclusions presented in previous studies regarding the disk scrubbing and the intradisk redundancy scheme.

[1]  Evangelos Eleftheriou,et al.  Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems , 2008, SIGMETRICS '08.

[2]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[3]  Darren Charles Sawyer,et al.  Dependability analysis of parallel systems using a simulation-based approach. M.S. Thesis , 1994 .

[4]  Ajay Dholakia,et al.  Analysis of a new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, SIGMETRICS/Performance.

[5]  Alma Riska,et al.  Enhancing data availability in disk drives through background activities , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[6]  Darrell D. E. Long,et al.  Using device diversity to protect data against batch-correlated disk failures , 2006, StorageSS '06.

[7]  Mary Baker,et al.  A fresh look at the reliability of long-term digital storage , 2005, EuroSys.

[8]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[9]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[10]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[11]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[12]  Ajay Dholakia,et al.  A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, TOS.