Being Accurate Is Not Enough: New Metrics for Disk Failure Prediction

Traditionally, disk failure prediction accuracy is used to evaluate disk failure prediction model. However, accuracy may not reflect their practical usage (protecting against failures, rather than only predicting failures) in cloud storage systems. In this paper, we propose two new metrics for disk failure prediction models: migration rate, which measures how much at-risk data is protected as a result of correct failure predictions, and mismigration rate, which measures how much data is migrated needlessly as a result of false failure predictions. To demonstrate their effectiveness, we compare disk failure prediction methods: (a) a classification tree (CT) model vs. a state-of-the-art recurrent neural network (RNN) model, and (b) a proposed residual life prediction model based on gradient boosted regression trees (GBRTs) vs. RNN. While prediction accuracy experiments favor the RNN model, migration rate experiments can favor the CT and GBRT models (depending on transfer rates). We conclude that prediction accuracy can be a misleading metric. Moreover, the proposed GBRT model offers a practical improvement in disk failure prediction in real-world data centers.

[1]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[2]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[3]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[4]  Joseph F. Murray,et al.  Hard drive failure prediction using non-parametric statistical methods , 2003 .

[5]  Bruce Allen,et al.  Monitoring hard disks with smart , 2004 .

[6]  Ethan L. Miller,et al.  Evaluation of distributed recovery in large-scale storage systems , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[7]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[8]  J. Sikora Disk failures in the real world : What does an MTTF of 1 , 000 , 000 hours mean to you ? , 2007 .

[9]  Weimin Zheng,et al.  Predicting Disk Failures with HMM- and HSMM-Based Approaches , 2010, ICDM.

[10]  Qiang Miao,et al.  Health monitoring of hard disk drive based on Mahalanobis distance , 2011, 2011 Prognostics and System Health Managment Confernece.

[11]  Gang Wang,et al.  Proactive drive failure prediction for large scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[12]  Qiang Miao,et al.  Online Anomaly Detection for Hard Disk Drives Based on Mahalanobis Distance , 2013, IEEE Transactions on Reliability.

[13]  Gang Wang,et al.  Hard Drive Failure Prediction Using Classification and Regression Trees , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[14]  Jun Liu,et al.  Fatman: Cost-saving and reliable archival storage based on volunteer resources , 2014, Proc. VLDB Endow..

[15]  Osman S. Unsal,et al.  ParaDIME: Parallel Distributed Infrastructure for Minimization of Energy for data centers , 2015, Microprocess. Microsystems.

[16]  Vincent Gramoli,et al.  Disaster-Tolerant Storage with SDN , 2015, NETYS.

[17]  Fred Douglis,et al.  RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures , 2015, FAST.

[18]  Mounir Hamdi,et al.  Designing efficient high performance server-centric data center network architecture , 2015, Comput. Networks.

[19]  Gang Wang,et al.  A Proactive Fault Tolerance Scheme for Large Scale Storage Systems , 2015, ICA3PP.

[20]  Gueyoung Jung,et al.  Ostro: Scalable Placement Optimization of Complex Application Topologies in Large-Scale Data Centers , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[21]  Vincenzo Mancuso,et al.  A Measurement-Based Characterization of the Energy Consumption in Data Center Servers , 2015, IEEE Journal on Selected Areas in Communications.

[22]  Hong Jiang,et al.  Proactive Data Migration for Improved Storage Availability in Large-Scale Data Centers , 2015, IEEE Transactions on Computers.

[23]  Gang Wang,et al.  A combined Bayesian network method for predicting drive failure times from SMART attributes , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[24]  Tie-Yan Liu,et al.  Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks , 2016, IEEE Transactions on Computers.

[25]  Saurabh Bagchi,et al.  Partial-parallel-repair (PPR): a distributed technique for repairing erasure coded storage , 2016, EuroSys.