New Metrics for Disk Failure Prediction That Go Beyond Prediction Accuracy

Prediction accuracy (true positives, false positives, and so on) is the usual way for evaluating disk-failure prediction models. Realistically however, we aim not only to correctly predict failures, but also to protect data against failure, i.e., we need to take appropriate action after a failure prediction. In the context of storage systems, protecting data requires that we migrate at-risk data, but this consumes network and disk bandwidth, which is particularly problematic for large-scale and cloud systems. This paper consolidates and builds on Li et al. (2016), where we propose using two new metrics, migration rate (MR) and mismigration rate (MMR), to measure the quality of disk failure prediction: MR measures how much at-risk data is migrated (and therefore protected) as a result of correct failure predictions, while MMR measures how much data is migrated needlessly as a result of incorrect failure predictions. In this paper, we additionally propose measuring quality in terms of migration time and mismigration time, which measure the time spent migrating at-risk disks, and the time spent mismigrating healthy disks caused by false alarms, respectively. To demonstrate these metrics’ usefulness, we use them to compare disk-failure prediction methods: we compare: 1) a classification tree (CT) model against a state-of-the-art recurrent neural network (RNN) model and 2) a gradient-boosted regression tree (GBRT) model (which predicts residual life) against RNN. We observe that while RNN performs best in the prediction accuracy experiments, the CT and GBRT models sometimes outperform RNN in the resource-dependent migration-rate experiments. We conclude that prediction accuracy is sometimes misleading: correct predictions do not necessarily imply protected data. We additionally present an improved GBRT model (GBRT+), which offers a practical improvement in disk residual-life prediction accordingly to the newly proposed metrics.

[1]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[2]  Joseph F. Murray,et al.  Hard drive failure prediction using non-parametric statistical methods , 2003 .

[3]  Tie-Yan Liu,et al.  Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks , 2016, IEEE Transactions on Computers.

[4]  Gang Wang,et al.  Hard Drive Failure Prediction Using Classification and Regression Trees , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[5]  Weimin Zheng,et al.  Predicting Disk Failures with HMM- and HSMM-Based Approaches , 2010, ICDM.

[6]  Javam C. Machado,et al.  A Fault Detection Method for Hard Disk Drives Based on Mixture of Gaussians and Nonparametric Statistics , 2017, IEEE Transactions on Industrial Informatics.

[7]  Fred Douglis,et al.  RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures , 2015, FAST.

[8]  Gang Wang,et al.  A combined Bayesian network method for predicting drive failure times from SMART attributes , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[9]  Gang Wang,et al.  ProCode: A Proactive Erasure Coding Scheme for Cloud Storage Systems , 2016, 2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS).

[10]  Ethan L. Miller,et al.  Evaluation of distributed recovery in large-scale storage systems , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[11]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[12]  Hai Jin,et al.  Disk Failure Prediction in Data Centers via Online Learning , 2018, ICPP.

[13]  Gang Wang,et al.  Hard drive failure prediction using Decision Trees , 2017, Reliab. Eng. Syst. Saf..

[14]  Hong Jiang,et al.  Proactive Data Migration for Improved Storage Availability in Large-Scale Data Centers , 2015, IEEE Transactions on Computers.

[15]  Bianca Schroeder,et al.  Proactive error prediction to improve storage system reliability , 2017, USENIX Annual Technical Conference.

[16]  Gang Wang,et al.  Proactive drive failure prediction for large scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[17]  Peng Li,et al.  Improving Service Availability of Cloud Systems by Predicting Disk Error , 2018, USENIX Annual Technical Conference.

[18]  Qiang Miao,et al.  Online Anomaly Detection for Hard Disk Drives Based on Mahalanobis Distance , 2013, IEEE Transactions on Reliability.

[19]  Kilian Q. Weinberger,et al.  Web-Search Ranking with Initialized Gradient Boosted Regression Trees , 2010, Yahoo! Learning to Rank Challenge.

[20]  T. Yoneyama,et al.  Prognostics performance metrics and their relation to requirements, design, verification and cost-benefit , 2008, 2008 International Conference on Prognostics and Health Management.

[21]  Sankalita Saha,et al.  Evaluating algorithm performance metrics tailored for prognostics , 2009, 2009 IEEE Aerospace conference.

[22]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[23]  Gang Wang,et al.  Being Accurate Is Not Enough: New Metrics for Disk Failure Prediction , 2016, 2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS).

[24]  K. Goebel,et al.  Metrics for evaluating performance of prognostic techniques , 2008, 2008 International Conference on Prognostics and Health Management.

[25]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[26]  Sankalita Saha,et al.  Prognostic Performance Metrics , 2011 .

[27]  Jun Liu,et al.  Fatman: Cost-saving and reliable archival storage based on volunteer resources , 2014, Proc. VLDB Endow..

[28]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[29]  Takashi Yoneyama,et al.  How to tell the good from the bad in failure prognostics methods , 2010, 2010 IEEE Aerospace Conference.

[30]  Qiang Miao,et al.  Health monitoring of hard disk drive based on Mahalanobis distance , 2011, 2011 Prognostics and System Health Managment Confernece.

[31]  Albert Mo Kim Cheng,et al.  Disk failure prediction in heterogeneous environments , 2017, 2017 International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS).

[32]  Gang Wang,et al.  A Proactive Fault Tolerance Scheme for Large Scale Storage Systems , 2015, ICA3PP.