Effect of Latent Errors on the Reliability of Data Storage Systems

The reliability of data storage systems is adversely affected by the presence of latent sector errors. As the number of occurrences of such errors increases with the storage capacity, latent sector errors have become more prevalent in today's high capacity storage devices. Such errors are typically not detected until an attempt is made to read the affected sectors. When a latent sector error is detected, the redundant data corresponding to the affected sector is used to recover its data. However, if no such redundant data is available, then the data of the affected sector is irrecoverably lost from the storage system. Therefore, the reliability of data storage systems is affected by both the complete failure of storage nodes and the latent sector errors within them. In this article, closed-form expressions for the mean time to data loss (MTTDL) of erasure coded storage systems in the presence of latent errors are derived. The effect of latent errors on systems with various types of redundancy, data placement, and sector error probabilities is studied. For small latent sector error probabilities, it is shown that the MTTDL is reduced by a factor that is independent of the number of parities in the data redundancy scheme as well as the number of nodes in the system. However, for large latent sector error probabilities, the MTTDL is similar to that of a system using a data redundancy scheme with one parity less. The reduction of the MTTDL in the latter case is more pronounced than in the former one.

[1]  Evangelos Eleftheriou,et al.  Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems , 2011, TOS.

[2]  Ajay Dholakia,et al.  A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, TOS.

[3]  Robert Haas,et al.  Reliability of Data Storage Systems under Network Rebuild Bandwidth Constraints , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[4]  Ilias Iliadis Reliability modeling of RAID storage systems with latent errors , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[5]  Christina Fragouli,et al.  Reliability of Clustered vs. Declustered Replica Placement in Data Storage Systems , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[6]  John C. S. Lui,et al.  Performance Analysis of Disk Arrays under Failure , 1990, VLDB.

[7]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[8]  Alexandros G. Dimakis,et al.  Distributed Storage Allocation for High Reliability , 2010, 2010 IEEE International Conference on Communications.

[9]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[10]  Bianca Schroeder,et al.  Understanding latent sector errors and how to protect against them , 2010, TOS.

[11]  Mario Blaum,et al.  Mirrored Disk Organization Reliability Analysis , 2006, IEEE Transactions on Computers.

[12]  Ilias Iliadis,et al.  Effect of Codeword Placement on the Reliability of Erasure Coded Data Storage Systems , 2013, QEST.

[13]  Xiaozhou Li,et al.  Reliability analysis of deduplicated and erasure-coded storage , 2011, PERV.

[14]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[15]  Ilias Iliadis,et al.  A General Reliability Model for Data Storage Systems , 2012, 2012 Ninth International Conference on Quantitative Evaluation of Systems.

[16]  Jim Davies,et al.  A Comparison of Replication Strategies for Reliable Decentralised Storage , 2006, J. Networks.

[17]  Mario Blaum,et al.  Higher reliability redundant disk arrays: Organization, operation, and coding , 2009, TOS.

[18]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[19]  Yunnan Wu,et al.  Network coding for distributed storage systems , 2010, IEEE Trans. Inf. Theory.

[20]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[21]  Ethan L. Miller,et al.  Evaluation of distributed recovery in large-scale storage systems , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[22]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.