Expected Annual Fraction of Data Loss as a Metric for Data Storage Reliability

Several redundancy and recovery schemes have been developed to enhance the reliability of storage systems. The effectiveness of these schemes has predominately been evaluated based on the mean time to data loss (MTTDL) metric, which has been proven useful for assessing tradeoffs, for comparing schemes, and for estimating the effect of the various parameters on system reliability. In the context of distributed and cloud storage systems, for economical reasons, it is of great importance to also consider the magnitude along with the frequency of data loss. We focus on the following reliability metric: the expected annual fraction of data loss (EAFDL), that is, the fraction of stored data that is expected to be lost by the system annually. We present a general methodology to obtain the EAFDL metric analytically, in conjunction with the MTTDL metric, for various redundancy schemes and for a large class of failure time distributions that also includes real-world distributions like Weibull and gamma. As a demonstration, we subsequently apply this methodology to derive these metrics analytically and to assess the reliability of a replication-based storage system under clustered, declustered, and symmetric data placement schemes. We show that the declustered placement scheme offers superior reliability in terms of both metrics. Previous work has used simulation to evaluate the magnitude of data loss, but this is the first work to analytically assess it, and the first to present a general theoretical framework for this context.

[1]  Dorian Mazauric,et al.  P2P storage systems: Study of different placement policies , 2013, Peer-to-Peer Networking and Applications.

[2]  Evangelos Eleftheriou,et al.  Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems , 2011, TOS.

[3]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[4]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[5]  Sachin Katti,et al.  Copysets: Reducing the Frequency of Data Loss in Cloud Storage , 2013, USENIX Annual Technical Conference.

[6]  Terry Williams,et al.  Probability and Statistics with Reliability, Queueing and Computer Science Applications , 1983 .

[7]  Christina Fragouli,et al.  Reliability of Clustered vs. Declustered Replica Placement in Data Storage Systems , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[8]  Ethan L. Miller,et al.  Evaluation of distributed recovery in large-scale storage systems , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[9]  Robert J. Chansler,et al.  Data Availability and Durability with the Hadoop Distributed File System , 2012, login Usenix Mag..

[10]  Ilias Iliadis,et al.  Effect of Codeword Placement on the Reliability of Erasure Coded Data Storage Systems , 2013, QEST.

[11]  James S. Plank,et al.  Mean Time to Meaningless: MTTDL, Markov Models, and Storage System Reliability , 2010, HotStorage.

[12]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[13]  Ilias Iliadis,et al.  A General Reliability Model for Data Storage Systems , 2012, 2012 Ninth International Conference on Quantitative Evaluation of Systems.

[14]  Mario Blaum,et al.  Higher reliability redundant disk arrays: Organization, operation, and coding , 2009, TOS.

[15]  Komal Shringare,et al.  Apache Hadoop Goes Realtime at Facebook , 2015 .

[16]  Ajay Dholakia,et al.  A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, TOS.

[17]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[18]  Ilias Iliadis,et al.  Effect of Latent Errors on the Reliability of Data Storage Systems , 2013, 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems.

[19]  Robert Haas,et al.  Reliability of Data Storage Systems under Network Rebuild Bandwidth Constraints , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.