Hard Drive Failure Prediction Using Classification and Regression Trees

Some statistical and machine learning methods have been proposed to build hard drive prediction models based on the SMART attributes, and have achieved good prediction performance. However, these models were not evaluated in the way as they are used in real-world data centers. Moreover, the hard drives deteriorate gradually, but these models can not describe this gradual change precisely. This paper proposes new hard drive failure prediction models based on Classification and Regression Trees, which perform better in prediction performance as well as stability and interpretability compared with the state-of the-art model, the Back propagation artificial neural network model. Experiments demonstrate that the Classification Tree (CT) model predicts over 95% of failures at a false alarm rate (FAR) under 0.1% on a real-world dataset containing 25,792 drives. Aiming at the practical application of prediction models, we test them with different drive families, with fewer number of drives, and with different model updating strategies. The CT model still shows steady and good performance. We propose a health degree model based on Regression Tree (RT) as well, which can give the drive a health assessment rather than a simple classification result. Therefore, the approach can deal with warnings raised by the prediction model in order of their health degrees. We implement a reliability model for RAID-6 systems with proactive fault tolerance and show that our CT model can significantly improve the reliability and/or reduce construction and maintenance cost of large-scale storage systems.

[1]  Qiang Miao,et al.  Online Anomaly Detection for Hard Disk Drives Based on Mahalanobis Distance , 2013, IEEE Transactions on Reliability.

[2]  Joseph F. Murray,et al.  Hard drive failure prediction using non-parametric statistical methods , 2003 .

[3]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[4]  Andrea C. Arpaci-Dusseau,et al.  An analysis of data corruption in the storage stack , 2008, TOS.

[5]  Xubin He,et al.  Failure Prediction Models for Proactive Fault Tolerance within Storage Systems , 2008, 2008 IEEE International Symposium on Modeling, Analysis and Simulation of Computers and Telecommunication Systems.

[6]  David A. Patterson,et al.  Designing Disk Arrays for High Data Reliability , 1993, J. Parallel Distributed Comput..

[7]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[8]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[9]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[10]  Gang Wang,et al.  Proactive drive failure prediction for large scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[11]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[12]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[13]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[14]  Graham J. Williams Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery , 2011 .

[15]  Bruce Allen,et al.  Monitoring hard disks with smart , 2004 .

[16]  Weimin Zheng,et al.  Predicting Disk Failures with HMM- and HSMM-Based Approaches , 2010, ICDM.

[17]  Qiang Miao,et al.  Health monitoring of hard disk drive based on Mahalanobis distance , 2011, 2011 Prognostics and System Health Managment Confernece.

[18]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.