Hard drive failure prediction using Decision Trees

This paper proposes two hard drive failure prediction models based on Decision Trees (DTs) and Gradient Boosted Regression Trees (GBRTs) which perform well in prediction performance as well as stability and interpretability. The models are evaluated on a real-world dataset containing 121,698 drives in total. Experimental results show the DT model predicts over 93% of failures at a false alarm rate under 0.01%, and the GBRT model can achieve about 90% failure detection rate without any false alarms. Moreover, the GBRT model evaluates drive health (or fault probability) which provides a quantitative indicator of failure urgency. This enables operators to allocate system resources accordingly for pre-warning migrations while maintaining the quality of user services.

[1]  Ethan L. Miller,et al.  Evaluation of distributed recovery in large-scale storage systems , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[2]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[3]  Gang Wang,et al.  Hard Drive Failure Prediction Using Classification and Regression Trees , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[4]  David A. Patterson,et al.  Designing Disk Arrays for High Data Reliability , 1993, J. Parallel Distributed Comput..

[5]  Gang Wang,et al.  Being Accurate Is Not Enough: New Metrics for Disk Failure Prediction , 2016, 2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS).

[6]  Enrico Zio,et al.  System dynamic reliability assessment and failure prognostics , 2017, Reliab. Eng. Syst. Saf..

[7]  Qiang Miao,et al.  Online Anomaly Detection for Hard Disk Drives Based on Mahalanobis Distance , 2013, IEEE Transactions on Reliability.

[8]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[9]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[10]  Gang Wang,et al.  Proactive drive failure prediction for large scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[11]  Fred Douglis,et al.  RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures , 2015, FAST.

[12]  Qiang Miao,et al.  Health monitoring of hard disk drive based on Mahalanobis distance , 2011, 2011 Prognostics and System Health Managment Confernece.

[13]  Javam C. Machado,et al.  BaNHFaP: A Bayesian Network Based Failure Prediction Approach for Hard Disk Drives , 2016, 2016 5th Brazilian Conference on Intelligent Systems (BRACIS).

[14]  Gang Wang,et al.  A combined Bayesian network method for predicting drive failure times from SMART attributes , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[15]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[16]  Chiranjib Bhattacharyya,et al.  Discovering Rules from Disk Events for Predicting Hard Drive Failures , 2009, 2009 International Conference on Machine Learning and Applications.

[17]  Mitra Fouladirad,et al.  Remaining useful lifetime estimation and noisy gamma deterioration process , 2016, Reliab. Eng. Syst. Saf..

[18]  Tie-Yan Liu,et al.  Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks , 2016, IEEE Transactions on Computers.

[19]  Bruce Allen,et al.  Monitoring hard disks with smart , 2004 .

[20]  Gang Wang,et al.  ProCode: A Proactive Erasure Coding Scheme for Cloud Storage Systems , 2016, 2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS).

[21]  Graham J. Williams Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery , 2011 .

[22]  Gautam Biswas,et al.  Methodologies for system-level remaining useful life prediction , 2016, Reliab. Eng. Syst. Saf..

[23]  Xubin He,et al.  Failure Prediction Models for Proactive Fault Tolerance within Storage Systems , 2008, 2008 IEEE International Symposium on Modeling, Analysis and Simulation of Computers and Telecommunication Systems.

[24]  Loon Ching Tang,et al.  Reliability evaluation of hard disk drive failures based on counting processes , 2013, Reliab. Eng. Syst. Saf..

[25]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[26]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[27]  Joseph F. Murray,et al.  Hard drive failure prediction using non-parametric statistical methods , 2003 .

[28]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[29]  Miroslaw Malek,et al.  Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[30]  Weimin Zheng,et al.  Predicting Disk Failures with HMM- and HSMM-Based Approaches , 2010, ICDM.

[31]  Xiaohui Gu,et al.  On Predictability of System Anomalies in Real World , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.