Transfer Learning based Failure Prediction for Minority Disks in Large Data Centers of Heterogeneous Disk Systems

The storage system in large scale data centers is typically built upon thousands or even millions of disks, where disk failures constantly happen. A disk failure could lead to serious data loss and thus system unavailability or even catastrophic consequences if the lost data cannot be recovered. While replication and erasure coding techniques have been widely deployed to guarantee storage availability and reliability, disk failure prediction is gaining popularity as it has the potential to prevent disk failures from occurring in the first place. Recent trends have turned toward applying machine learning approaches based on disk SMART attributes for disk failure predictions. However, traditional machine learning (ML) approaches require a large set of training data in order to deliver good predictive performance. In large-scale storage systems, new disks enter gradually to augment the storage capacity or to replace failed disks, leading storage systems to consist of small amounts of new disks from different vendors and/or different models from the same vendor as time goes on. We refer to this relatively small amount of disks as minority disks. Due to the lack of sufficient training data, traditional ML approaches fail to deliver satisfactory predictive performance in evolving storage systems which consist of heterogeneous minority disks. To address this challenge and improve the predictive performance for minority disks in large data centers, we propose a minority disk failure prediction model named TLDFP based on a transfer learning approach. Our evaluation results on two realistic datasets have demonstrated that TLDFP can deliver much more precise results, compared to four popular prediction models based on traditional ML algorithms and two state-of-the-art transfer learning methods.

[1]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Miroslav Kubat,et al.  Combining Subclassifiers in Text Categorization: A DST-Based Solution and a Case Study , 2007, IEEE Transactions on Knowledge and Data Engineering.

[3]  Maayan Harel,et al.  Learning from Multiple Outlooks , 2010, ICML.

[4]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[5]  Lars Grunske,et al.  A comparison of machine learning algorithms for proactive hard disk drive failure detection , 2013, ISARCS '13.

[6]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[7]  L. Miles,et al.  2000 , 2000, RDH.

[8]  Bianca Schroeder,et al.  Improving Storage System Reliability with Proactive Error Prediction , 2017 .

[9]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[10]  Qiang Yang,et al.  Boosting for transfer learning , 2007, ICML '07.

[11]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[12]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[13]  Robert B. Ross,et al.  Fail-Slow at Scale , 2018, ACM Trans. Storage.

[14]  Gang Wang,et al.  Hard drive failure prediction using Decision Trees , 2017, Reliab. Eng. Syst. Saf..

[15]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[16]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[17]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[18]  Arif Merchant,et al.  Flash Reliability in Production: The Expected and the Unexpected , 2016, FAST.

[19]  Gang Wang,et al.  Proactive drive failure prediction for large scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[20]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[21]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[22]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[23]  Ivor W. Tsang,et al.  Hybrid Heterogeneous Transfer Learning through Deep Learning , 2014, AAAI.

[24]  Tie-Yan Liu,et al.  Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks , 2016, IEEE Transactions on Computers.

[25]  Bruce Allen,et al.  Monitoring hard disks with smart , 2004 .

[26]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[27]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[28]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[29]  Joseph F. Murray,et al.  Hard drive failure prediction using non-parametric statistical methods , 2003 .

[30]  Jasmina Bogojeska,et al.  Predicting Disk Replacement towards Reliable Data Centers , 2016, KDD.

[31]  Song Huang,et al.  Characterizing Disk Failures with Quantified Disk Degradation Signatures: An Early Experience , 2015, 2015 IEEE International Symposium on Workload Characterization.

[32]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[33]  Peng Li,et al.  Improving Service Availability of Cloud Systems by Predicting Disk Error , 2018, USENIX ATC.

[34]  Benno Stein,et al.  Cross-Language Text Classification Using Structural Correspondence Learning , 2010, ACL.

[35]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[36]  Tong Zhang,et al.  Learning Nonlinear Functions Using Regularized Greedy Forest , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Lukás Burget,et al.  Neural network based language models for highly inflective languages , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  Ali Khan,et al.  A Practical Approach to Hard Disk Failure Prediction in Cloud Platforms: Big Data Model for Failure Management in Datacenters , 2016, 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService).

[39]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[40]  Igor V. Tetko,et al.  Neural network studies, 1. Comparison of overfitting and overtraining , 1995, J. Chem. Inf. Comput. Sci..

[41]  Qiang Wu,et al.  A Large-Scale Study of Flash Memory Failures in the Field , 2015, SIGMETRICS 2015.

[42]  Arkady Kanevsky,et al.  Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics , 2008, TOS.

[43]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[44]  Wenjun Yang,et al.  Hard Drive Failure Prediction Using Big Data , 2015, 2015 IEEE 34th Symposium on Reliable Distributed Systems Workshop (SRDSW).

[45]  Javam C. Machado,et al.  Transfer Learning for Bayesian Networks with Application on Hard Disk Drives Failure Prediction , 2017, 2017 Brazilian Conference on Intelligent Systems (BRACIS).

[46]  Gang Wang,et al.  Hard Drive Failure Prediction Using Classification and Regression Trees , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[47]  Gang Wang,et al.  Being Accurate Is Not Enough: New Metrics for Disk Failure Prediction , 2016, 2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS).

[48]  Trevor Darrell,et al.  What you saw is not what you get: Domain adaptation using asymmetric kernel transforms , 2011, CVPR 2011.