P3: Priority based proactive prediction for soon-to-fail disks

Predicting soon-to-fail (STF) disks is fundamental to keeping disk data safe and enforcing quality of service. Most current proactive prediction approaches achieve high prediction rate at the cost of high false alarm rate, labeling healthy disks as STF, because of the imbalanced fraction of failed disks in the training dataset and the characteristics of the machine learning (ML) techniques used. Given the known fact that healthy disks far outnumber STF disks, high false alarm rate means that more healthy disks than the actual STF disks maybe labeled as STF and results in undue waste of resources such as network bandwidth and new disks. The cumulative number of false alarms can be even larger considering that the prediction is taken periodically. This paper presents a priority based proactive prediction algorithm for STF disks (or P3), which leverages a combination of ML models. The predictor takes the attributes of the self monitoring facility (SMART) of all disks as input and outputs predicted STF disks. Compared to existing approaches, P3 can achieve lower false alarm rate, which is important to efficiently schedule resources for disk data and service migration. The tradeoff is slight decrease in prediction rate, which is negligible because the prediction is made periodically and reactive prediction can be employed as backup. In an evaluation on a population of 7,018 disks with Weka, the predictor can predict 112 out of the 130 failed disks with 36 false alarms. Compared with the state-of-art ML models that predict 122 failed disks with 34 false alarms, P3 is able to predict 113 failed disks with 7 false alarms.

[1]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[2]  Lars Grunske,et al.  A comparison of machine learning algorithms for proactive hard disk drive failure detection , 2013, ISARCS '13.

[3]  Moisés Goldszmidt Finding Soon-to-Fail Disks in a Haystack , 2012, HotStorage.

[4]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[5]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  Byrav Ramamurthy,et al.  Budget-Minimized Resource Allocation and Task Scheduling in Distributed Grid/Clouds , 2013, 2013 22nd International Conference on Computer Communication and Networks (ICCCN).

[8]  Hong Jiang,et al.  IDO: Intelligent Data Outsourcing with Improved RAID Reconstruction Performance in Large-Scale Data Centers , 2012, LISA.

[9]  Xubin He,et al.  Failure Prediction Models for Proactive Fault Tolerance within Storage Systems , 2008, 2008 IEEE International Symposium on Modeling, Analysis and Simulation of Computers and Telecommunication Systems.

[10]  Weimin Zheng,et al.  Predicting Disk Failures with HMM- and HSMM-Based Approaches , 2010, ICDM.

[11]  Hong Jiang,et al.  Proactive Data Migration for Improved Storage Availability in Large-Scale Data Centers , 2015, IEEE Transactions on Computers.

[12]  Foster Provost,et al.  Machine Learning from Imbalanced Data Sets 101 , 2008 .

[13]  Byrav Ramamurthy,et al.  Cost-optimized joint resource allocation in grids/clouds with multilayer optical network architecture , 2014, IEEE/OSA Journal of Optical Communications and Networking.

[14]  Gang Wang,et al.  Proactive drive failure prediction for large scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).