Global Reliability Evaluation for Cloud Storage Systems with Proactive Fault Tolerance

In addition to the traditional reactive fault-tolerant technologies, such as erasure codes and replication, proactive fault tolerance can be used to improve the system’s reliability significantly. To the best of our knowledge, however, there is no previous publications on the reliability of such a cloud storage system except for those on RAID systems. In this paper, several Markov-based models are respectively proposed to evaluate the reliability of the cloud storage systems with/without proactive fault tolerance from the system perspective. Since proactive measure should be coupled with some reactive measure to ensure the systems reliability, the reliability model for such a system will be very intricate. To facilitate model building, we propose the basic state transition unit (BSTU), to describe the general pattern of state transition in the proactive cloud storage systems. BSTU serves as the generic “brick” for building the overall reliability model for such a system. Using our models, we demonstrate the benefits that proactive fault tolerance has on a system’s reliability, and also estimate the impacts of some system parameters on it.

[1]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[2]  Ilias Iliadis,et al.  A General Reliability Model for Data Storage Systems , 2012, 2012 Ninth International Conference on Quantitative Evaluation of Systems.

[3]  Jim Davies,et al.  A Comparison of Replication Strategies for Reliable Decentralised Storage , 2006, J. Networks.

[4]  Wei Chen,et al.  An Analytical Framework and Its Applications for Studying Brick Storage Reliability , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[5]  Zheng Zhang,et al.  Reperasure: Replication Protocol using Erasure-code , 2002 .

[6]  Weimin Zheng,et al.  Predicting Disk Failures with HMM- and HSMM-Based Approaches , 2010, ICDM.

[7]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[8]  Gang Wang,et al.  Hard Drive Failure Prediction Using Classification and Regression Trees , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[9]  Xubin He,et al.  Failure Prediction Models for Proactive Fault Tolerance within Storage Systems , 2008, 2008 IEEE International Symposium on Modeling, Analysis and Simulation of Computers and Telecommunication Systems.

[10]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[11]  Gang Wang,et al.  Proactive drive failure prediction for large scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[12]  Djalma M. Falcao,et al.  Composite reliability evaluation by sequential Monte Carlo simulation on parallel and distributed processing environments , 2001 .

[13]  Joseph F. Murray,et al.  Hard drive failure prediction using non-parametric statistical methods , 2003 .

[14]  James Lee Hafner,et al.  Reliability for Networked Storage Nodes , 2011, IEEE Transactions on Dependable and Secure Computing.

[15]  David Hausheer,et al.  The Design and Evaluation of a Distributed Reliable File System , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.