Characterizing Disk Failures with Quantified Disk Degradation Signatures: An Early Experience

With the advent of cloud computing and online services, large enterprises rely heavily on their data centers to serve end users. Among different server components, hard disk drives are known to contribute significantly to server failures. Disk failures as well as their impact on the performance of storage systems and operating costs are becoming an increasingly important concern for data center designers and operators. However, there is very little understanding on the characteristics of disk failures in data centers. Effective disk failure management and data recovery also requires a deep understanding of the nature of disk failures. In this paper, we present a systematic approach to provide a holistic and insightful view of disk failures. We study a large-scale storage system from a production data center. We categorize disk failures based on their distinctive manifestations and properties. Then we characterize the degradation of disk errors to failures by deriving the degradation signatures for each failure category. The influence of disk health attributes on failure degradation is also quantified. We discuss leveraging the derived degradation signatures to forecast disk failures even in their early stages. To the best of our knowledge, this is the first work that shows how to discover the categories of disk failures and characterize their degradation processes on a production data center.

[1]  Jim Gray,et al.  Empirical Measurements of Disk Failure Rates and Error Rates , 2007, ArXiv.

[2]  Ziming Zhang,et al.  Failure prediction for autonomic management of networked computer systems with availability assurance , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[3]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[4]  Bruce Allen,et al.  Monitoring hard disks with smart , 2004 .

[5]  Gang Wang,et al.  Hard Drive Failure Prediction Using Classification and Regression Trees , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[6]  Qiang Miao,et al.  Online Anomaly Detection for Hard Disk Drives Based on Mahalanobis Distance , 2013, IEEE Transactions on Reliability.

[7]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[8]  Ethan L. Miller,et al.  Disk infant mortality in large storage systems , 2005, 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[9]  Weisong Shi,et al.  Towards realistic benchmarking for cloud file systems: Early experiences , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[10]  Xubin He,et al.  Failure Prediction Models for Proactive Fault Tolerance within Storage Systems , 2008, 2008 IEEE International Symposium on Modeling, Analysis and Simulation of Computers and Telecommunication Systems.

[11]  Weimin Zheng,et al.  Predicting Disk Failures with HMM- and HSMM-Based Approaches , 2010, ICDM.

[12]  S. Shah,et al.  Server class disk drives: how reliable are they? , 2004, Annual Symposium Reliability and Maintainability, 2004 - RAMS.

[13]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[14]  Sriram Sankar,et al.  Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures , 2013, TOS.

[15]  Joseph F. Murray,et al.  Hard drive failure prediction using non-parametric statistical methods , 2003 .

[16]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[17]  Andrea C. Arpaci-Dusseau,et al.  An analysis of data corruption in the storage stack , 2008, TOS.

[18]  J. Sikora Disk failures in the real world : What does an MTTF of 1 , 000 , 000 hours mean to you ? , 2007 .

[19]  Song Fu,et al.  Adaptive Anomaly Identification by Exploring Metric Subspace in Cloud Computing Infrastructures , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[20]  Teck Chaw Ling,et al.  Thermal-Aware Scheduling in Green Data Centers , 2015, ACM Comput. Surv..

[21]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[22]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[23]  Ziming Zhang,et al.  Proactive Failure Management by Integrated Unsupervised and Semi-Supervised Learning for Dependable Cloud Systems , 2011, 2011 Sixth International Conference on Availability, Reliability and Security.

[24]  Weisong Shi,et al.  Workload characterization on a production Hadoop cluster: A case study on Taobao , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[25]  Fred Douglis,et al.  RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures , 2015, FAST.

[26]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[27]  Graeme R. Cole Estimating Drive Reliability in Desktop Computers and Consumer Electronics , 2003 .