ADF2T: an Active Disk Failure Forecasting and Tolerance Software

The reliability of distributed file system is inevitably affected by hard disk failure. This paper proposes an active disk failure forecasting and tolerance software. Firstly, multiple SMART records in the time window are merged into one sample, and after sliding, tens of times of positive samples are created. Secondly, the features are selected by two-stage sorting method, so that the most conducive features are used in machine learning modeling, and the time for model training can be shortened obviously. Thirdly, through two-stage verification, parameters can be adjusted in time for unreasonable proactive reconstruction strategies. Experiments show that modeling and forecast of ZTE data set and Backblaze data set respectively, the recall rate is 95.66% and 84.28%, and the error rate is 0.23% and 2.45%. The work in this paper has been commercially used for more than one year in ZTE data center. The reliability of distributed file system software is significantly improved.

[1]  Lars Grunske,et al.  A comparison of machine learning algorithms for proactive hard disk drive failure detection , 2013, ISARCS '13.

[2]  Chih-Fong Tsai,et al.  Under-sampling class imbalanced datasets by combining clustering analysis and instance selection , 2019, Inf. Sci..

[3]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[4]  Sriram Sankar,et al.  Impact of temperature on hard disk drive reliability in large datacenters , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[5]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[6]  Bruce Allen,et al.  Monitoring hard disks with smart , 2004 .

[7]  Peng Li,et al.  Improving Service Availability of Cloud Systems by Predicting Disk Error , 2018, USENIX ATC.

[8]  Donghai Guan,et al.  Nearest neighbor editing aided by unlabeled data , 2009, Inf. Sci..

[9]  Teik-Toe Teoh,et al.  Hidden Markov Model for hard-drive failure detection , 2012, 2012 7th International Conference on Computer Science & Education (ICCSE).

[10]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[11]  Gang Wang,et al.  A combined Bayesian network method for predicting drive failure times from SMART attributes , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[12]  Saroj K. Biswas,et al.  Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance , 2017, Pattern Recognit. Lett..

[13]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[14]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, ICDM.

[15]  Francisco Herrera,et al.  SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering , 2015, Inf. Sci..

[16]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[17]  Rui Liu,et al.  Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification , 2019, Inf. Sci..