General Feature Selection for Failure Prediction in Large-scale SSD Deployment

Solid-state drive (SSD) failures are likely to cause system-level failures leading to downtime, enabling SSD failure prediction to be critical to large-scale SSD deployment. Existing SSD failure prediction studies are mostly based on customized SSDs with proprietary monitoring metrics, which are difficult to reproduce. To support general SSD failure prediction of different drive models and vendors, this paper proposes Wear-out-updating Ensemble Feature Ranking (WEFR) to select the SMART attributes as learning features in an automated and robust manner. WEFR combines different feature ranking results and automatically generates the final feature selection based on the complexity measures and the change point detection of wear-out degrees. We evaluate our approach using a dataset of nearly 500K working SSDs at Alibaba. Our results show that the proposed approach is effective and outperforms related approaches. We have successfully applied the proposed approach to improve the reliability of cloud storage systems in production SSD-based data centers. We release our dataset for public use.

[1]  Paul Fearnhead,et al.  Exact and efficient Bayesian inference for multiple changepoint problems , 2006, Stat. Comput..

[2]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[3]  Justin Meza,et al.  Large Scale Studies of Memory, Storage, and Network Failures in a Modern Data Center , 2019, ArXiv.

[4]  Bianca Schroeder,et al.  Proactive error prediction to improve storage system reliability , 2017, USENIX ATC.

[5]  Geoffrey H. Ball,et al.  Some Implications of Interactive Graphic Computer Systems for Data Analysis and Statistics , 1970 .

[6]  K. Pearson NOTES ON THE HISTORY OF CORRELATION , 1920 .

[7]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[8]  Peng Li,et al.  Improving Service Availability of Cloud Systems by Predicting Disk Error , 2018, USENIX ATC.

[9]  B. Reiser,et al.  Estimation of the Youden Index and its Associated Cutoff Point , 2005, Biometrical journal. Biometrische Zeitschrift.

[10]  Arif Merchant,et al.  Flash Reliability in Production: The Expected and the Unexpected , 2016, FAST.

[11]  Patrick P. C. Lee,et al.  An In-Depth Study of Correlated Failures in Production SSD-Based Data Centers , 2021, FAST.

[12]  Weisong Shi,et al.  Making Disk Failure Predictions SMARTer! , 2020, FAST.

[13]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[14]  Gregory R. Ganger,et al.  Cluster storage systems gotta have HeART: improving storage efficiency by exploiting disk-reliability heterogeneity , 2019, FAST.

[15]  Neal R. Mielke,et al.  Reliability of Solid-State Drives Based on NAND Flash Memory , 2017, Proceedings of the IEEE.

[16]  Oshry Ben-Harush,et al.  Predicting HDD failures from compound SMART attributes , 2017, SYSTOR.

[17]  Verónica Bolón-Canedo,et al.  On developing an automatic threshold applied to feature selection ensembles , 2018, Inf. Fusion.

[18]  Jay Sarkar,et al.  Machine-learned assessment and prediction of robust solid state storage system reliability physics , 2018, 2018 IEEE International Reliability Physics Symposium (IRPS).

[19]  Gang Wang,et al.  Hard Drive Failure Prediction Using Classification and Regression Trees , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[20]  Krishnendu Chakrabarty,et al.  System-level hardware failure prediction using deep learning , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[21]  Jing Shen,et al.  Random-forest-based failure prediction for hard disk drives , 2018, Int. J. Distributed Sens. Networks.

[22]  Verónica Bolón-Canedo,et al.  Using Data Complexity Measures for Thresholding in Feature Selection Rankers , 2016, CAEPIA.

[23]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[24]  Jasmina Bogojeska,et al.  Predicting Disk Replacement towards Reliable Data Centers , 2016, KDD.

[25]  P. Lee,et al.  Toward Adaptive Disk Failure Prediction via Stream Mining , 2020, 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS).

[26]  J. Wade Davis,et al.  Statistical Pattern Recognition , 2003, Technometrics.

[27]  Jie Liu,et al.  SSD Failures in Datacenters: What? When? and Why? , 2016, SYSTOR.

[28]  Jiesheng Wu,et al.  Lessons and Actions: What We Learned from 10K SSD-Related Storage System Failures , 2019, USENIX Annual Technical Conference.

[29]  Evgenia Smirni,et al.  SSD failures in the field: symptoms, causes, and prediction models , 2019, SC.

[30]  Arif Merchant,et al.  Reliability of nand-Based SSDs: What Field Studies Tell Us , 2017, Proceedings of the IEEE.

[31]  Javam C. Machado,et al.  Predicting Failures in Hard Drives with LSTM Networks , 2017, 2017 Brazilian Conference on Intelligent Systems (BRACIS).

[32]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[33]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[34]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..