Cost‐efficiency disk failure prediction via threshold‐moving

Self‐Monitoring, Analysis, and Reporting Technology (SMART) is a technology in hard disk drives to predict impending disk failures for data repair in advance. As the prediction accuracy of SMART is unsatisfactory, recently, machine learning techniques have been explored to improve the prediction accuracy. Those approaches treat disk failure prediction as a binary classification problem and take SMART attributes as features, and some of them achieve satisfactory prediction accuracy. However, there is no uniform metric to measure the financial impact of these methods whose primary objective is to reduce disk failure recovery costs via disk failure prediction. In this article, from a financial impact perspective, we propose a simple, yet practical, metric Mean‐Cost‐To‐Recovery (MCTR) for disk failure prediction in data centers. Specifically, by assigning different weights to mispredicted healthy disks and failed disks, we measure the entire misprediction costs, that is, MCTR. In addition, we argue that the commonly used threshold 0.5 for disk failure prediction is suboptimal because of the fact of data imbalance, that is, failed disks are much fewer than healthy ones. To find the optimal threshold which renders minimal MCTR, we wrap a cost‐minimizing procedure around disk failure prediction and use a threshold‐moving technique for searching. Moreover, to map sample‐level prediction results to disk‐level prediction results, a modified leaky‐bucket algorithm is design to determine the disk health state by considering its multiple sample‐level prediction results. To evaluate the effectiveness of our approach, we conduct extensive experiments using three real‐world datasets. The experimental results show that compared with reactive data protection schemes, we can reduce MCTR by up to 86.9%, and compared with cost‐blind failure predictions, we can reduce MCTR by up to 22.3%.

[1]  Ping Huang,et al.  Scrub Unleveling: Achieving High Data Reliability at Low Scrubbing Cost , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[2]  Gustavo E. A. P. A. Batista,et al.  Class imbalance revisited: a new experimental setup to assess the performance of treatment methods , 2014, Knowledge and Information Systems.

[3]  Tomasz Wiktorski,et al.  Adaptive real‐time anomaly detection in cloud infrastructures , 2017, Concurr. Comput. Pract. Exp..

[4]  Ke Zhou,et al.  Improving Cache Performance for Large-Scale Photo Stores via Heuristic Prefetching Scheme , 2019, IEEE Transactions on Parallel and Distributed Systems.

[5]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[6]  Ling Shao,et al.  Deep Self-Taught Hashing for Image Retrieval , 2019, IEEE Transactions on Cybernetics.

[7]  Bianca Schroeder,et al.  Proactive error prediction to improve storage system reliability , 2017, USENIX ATC.

[8]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[9]  Gang Wang,et al.  Proactive drive failure prediction for large scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[10]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[11]  Wenjun Yang,et al.  Hard Drive Failure Prediction Using Big Data , 2015, 2015 IEEE 34th Symposium on Reliable Distributed Systems Workshop (SRDSW).

[12]  Hong Jiang,et al.  XI-Code: A Family of Practical Lowest Density MDS Array Codes of Distance 4 , 2016, IEEE Transactions on Communications.

[13]  Ke Zhou,et al.  An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning , 2019, SIGMOD Conference.

[14]  Moisés Goldszmidt Finding Soon-to-Fail Disks in a Haystack , 2012, HotStorage.

[15]  Hongshik Ahn,et al.  The Use of Decision Threshold Adjustment in Classification for Cancer Prediction , 2005 .

[16]  Gang Wang,et al.  Being Accurate Is Not Enough: New Metrics for Disk Failure Prediction , 2016, 2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS).

[17]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[18]  Peng Li,et al.  Improving Service Availability of Cloud Systems by Predicting Disk Error , 2018, USENIX ATC.

[19]  Bruce Allen,et al.  Monitoring hard disks with smart , 2004 .

[20]  Gang Wang,et al.  ProCode: A Proactive Erasure Coding Scheme for Cloud Storage Systems , 2016, 2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS).

[21]  Gang Wang,et al.  Hard Drive Failure Prediction Using Classification and Regression Trees , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  Hong Jiang,et al.  Proactive Data Migration for Improved Storage Availability in Large-Scale Data Centers , 2015, IEEE Transactions on Computers.

[24]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[25]  Gang Wang,et al.  A Proactive Fault Tolerance Scheme for Large Scale Storage Systems , 2015, ICA3PP.

[26]  Jasmina Bogojeska,et al.  Predicting Disk Replacement towards Reliable Data Centers , 2016, KDD.

[27]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[28]  Fred Douglis,et al.  RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures , 2015, FAST.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Yang Gao,et al.  A decision‐making solution for cloud storage system , 2018, Concurr. Comput. Pract. Exp..

[31]  Thach Huy Nguyen,et al.  Cost-Xensitive XCS Classifier System Addressing Imbalance Problems , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[32]  Joseph F. Murray,et al.  Hard drive failure prediction using non-parametric statistical methods , 2003 .

[33]  Jun Liu,et al.  Fatman: Cost-saving and reliable archival storage based on volunteer resources , 2014, Proc. VLDB Endow..

[34]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[35]  Xiaomin Zhu,et al.  A server consolidation method with integrated deep learning predictor in local storage based clouds , 2018, Concurr. Comput. Pract. Exp..

[36]  Weimin Zheng,et al.  Predicting Disk Failures with HMM- and HSMM-Based Approaches , 2010, ICDM.

[37]  Chaiwat Oottamakorn,et al.  Statistical service assurances for traffic scheduling algorithms , 2000, IEEE Journal on Selected Areas in Communications.

[38]  Gang Wang,et al.  Hard drive failure prediction using Decision Trees , 2017, Reliab. Eng. Syst. Saf..

[39]  Hong Jiang,et al.  P3: Priority based proactive prediction for soon-to-fail disks , 2015, 2015 IEEE International Conference on Networking, Architecture and Storage (NAS).

[40]  Hai Jin,et al.  Disk Failure Prediction in Data Centers via Online Learning , 2018, ICPP.

[41]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[42]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[43]  Ying Ju,et al.  Finding the Best Classification Threshold in Imbalanced Classification , 2016, Big Data Res..

[44]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[45]  Hung-Hsiang Jonathan Chao Design of leaky bucket access control schemes in ATM networks , 1991, ICC 91 International Conference on Communications Conference Record.