Disk Failure Prediction in Data Centers via Online Learning

Disk failure has become a major concern with the rapid expansion of storage systems in data centers. Based on SMART (Self-Monitoring, Analysis and Reporting Technology) attributes, many researchers derive disk failure prediction models using machine learning techniques. Despite the significant developments, the majority of works rely on offline training and thereby hinder their adaption to the continuous update of forthcoming data, suffering from the 'model aging' problem. We are therefore motivated to uncover the root cause -- the dynamic SMART distribution for 'model aging', aiming to resolve the performance degradation as to pave a comprehensive study in practice. In this paper, we introduce a novel disk failure prediction model using Online Random Forests (ORFs). Our ORF-based model can automatically evolve with sequential arrival of data on-the-fly and thus is highly adaptive to the variance of SMART distribution over time. Moreover, it has favourable advantage against the offline counterparts in terms of superior prediction performance. Experiments on real-world datasets show that our ORF model converges rapidly to the offline random forests and achieves stable failure detection rates of 93-99% with low false alarm rates. Furthermore, we demonstrate the ability of our approach on maintaining stable prediction performance for the long-term usage in data centers.

[1]  Gang Wang,et al.  Being Accurate Is Not Enough: New Metrics for Disk Failure Prediction , 2016, 2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS).

[2]  Chiranjib Bhattacharyya,et al.  Discovering Rules from Disk Events for Predicting Hard Drive Failures , 2009, 2009 International Conference on Machine Learning and Applications.

[3]  Oscar Fontenla-Romero,et al.  Online Machine Learning , 2024, Machine Learning: Foundations, Methodologies, and Applications.

[4]  Tie-Yan Liu,et al.  Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks , 2016, IEEE Transactions on Computers.

[5]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[6]  Jasmina Bogojeska,et al.  Predicting Disk Replacement towards Reliable Data Centers , 2016, KDD.

[7]  Bianca Schroeder,et al.  Proactive error prediction to improve storage system reliability , 2017, USENIX ATC.

[8]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[9]  Gang Wang,et al.  Hard drive failure prediction using Decision Trees , 2017, Reliab. Eng. Syst. Saf..

[10]  Gang Wang,et al.  Hard Drive Failure Prediction Using Classification and Regression Trees , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[11]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[12]  Horst Bischof,et al.  On-line Random Forests , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[13]  Bianca Schroeder,et al.  Improving Storage System Reliability with Proactive Error Prediction , 2017 .

[14]  Tommy W. S. Chow,et al.  A Two-Step Parametric Method for Failure Prediction in Hard Disk Drives , 2014, IEEE Transactions on Industrial Informatics.

[15]  L. Breiman OUT-OF-BAG ESTIMATION , 1996 .

[16]  Sriram Sankar,et al.  Environmental Conditions and Disk Reliability in Free-cooled Datacenters , 2016, USENIX Annual Technical Conference.

[17]  WangGang,et al.  Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks , 2016 .

[18]  Qiang Miao,et al.  Online Anomaly Detection for Hard Disk Drives Based on Mahalanobis Distance , 2013, IEEE Transactions on Reliability.

[19]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[20]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[23]  Gang Wang,et al.  Proactive drive failure prediction for large scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[24]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[25]  Andrea C. Arpaci-Dusseau,et al.  An analysis of data corruption in the storage stack , 2008, TOS.

[26]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.