Failure Order: A Missing Piece in Disk Failure Processing of Data Centers

To avoid data loss, data centers adopt disk failure prediction (DFP) technology to raise warnings ahead of actual disk failures, and process the warnings in the order they are raised, i.e., a first-in-first-out (FIFO) warning order. The FIFO-guided warning order can process warnings timely when disk failures are rare in data centers. With the growing scale of data centers, the increasing number of disk failures leads to a complex situation that multiple warnings are raised simultaneously, where the FIFO-guided warning order neither processes warnings timely, nor manages warnings properly due to lack of the priority of warnings. Thus, a real-time and finer-grained priority guidance for warning order management is an urgent need. To this end, we turn our attention to the failures since each warning corresponds to a fail event. The key insight is that the interdependence of failures, i.e., the order failure occurred, indicates the order of warning processing. With an accurate failure order, data centers can decrease the probability of data loss and the downtime of latency-sensitive applications by processing urgent warnings in advance. In this paper, we predict the failure order with a LambdaMART model, which is a state-of-the-art ranking algorithm in information retrieval. To avoid overly concerning on the correctness of high-rank warnings in information retrieval, we design a symmetric metric to evaluate the prediction evaluation of failure order. Experiment on a public dataset, provided by the Backblaze company, shows that our model outperforms the FIFO order and the order from previous DFP models.

[1]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[2]  Hang Li,et al.  A Short Introduction to Learning to Rank , 2011, IEICE Trans. Inf. Syst..

[3]  Donald O. Case,et al.  Looking for Information: A Survey of Research on Information Seeking, Needs and Behavior , 2012 .

[4]  Hai Jin,et al.  A Large-Scale Study of I/O Workload’s Impact on Disk Failure , 2018, IEEE Access.

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[7]  Gang Wang,et al.  Hard drive failure prediction using Decision Trees , 2017, Reliab. Eng. Syst. Saf..

[8]  Gang Wang,et al.  Hard Drive Failure Prediction Using Classification and Regression Trees , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[9]  M. de Rijke,et al.  Multileave Gradient Descent for Fast Online Learning to Rank , 2016, WSDM.

[10]  Hai Jin,et al.  Disk Failure Prediction in Data Centers via Online Learning , 2018, ICPP.

[11]  Qiang Miao,et al.  Online Anomaly Detection for Hard Disk Drives Based on Mahalanobis Distance , 2013, IEEE Transactions on Reliability.

[12]  Paul Solomon,et al.  Looking for Information—A Survey of Research on Information Seeking, Needs, and Behavior , 2003, Information Retrieval.

[13]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[14]  Gang Wang,et al.  Proactive drive failure prediction for large scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[15]  Sriram Sankar,et al.  Environmental Conditions and Disk Reliability in Free-cooled Datacenters , 2016, USENIX Annual Technical Conference.

[16]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[17]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[18]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[19]  Alfred C. Weaver,et al.  Learning to rank results in relational keyword search , 2011, CIKM '11.

[20]  Andrew Hogue,et al.  Learning to rank for spatiotemporal search , 2013, WSDM.

[21]  Thomas Hofmann,et al.  Learning to Rank with Nonsmooth Cost Functions , 2006, NIPS.

[22]  Tommy W. S. Chow,et al.  A Two-Step Parametric Method for Failure Prediction in Hard Disk Drives , 2014, IEEE Transactions on Industrial Informatics.

[23]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .

[24]  Tie-Yan Liu,et al.  Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks , 2016, IEEE Transactions on Computers.

[25]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[26]  Jasmina Bogojeska,et al.  Predicting Disk Replacement towards Reliable Data Centers , 2016, KDD.