Making Disk Failure Predictions SMARTer!

Disk drives are one of the most commonly replaced hardware components and continue to pose challenges for accurate failure prediction. In this work, we present analysis and findings from one of the largest disk failure prediction studies covering a total of 380,000 hard drives over a period of two months across 64 sites of a large leading data center operator. Our proposed machine learning based models predict disk failures with 0.95 F-measure and 0.95 Matthews correlation coefficient (MCC) for 10-days prediction horizon on average.

[1]  Graeme R. Cole Estimating Drive Reliability in Desktop Computers and Consumer Electronics , 2003 .

[2]  Sabri Boughorbel,et al.  Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric , 2017, PloS one.

[3]  Onur Mutlu,et al.  ERRoR ANAlysIs AND RETENTIoN-AwARE ERRoR MANAgEMENT FoR NAND FlAsh MEMoRy , 2013 .

[4]  Arif Merchant,et al.  Flash Reliability in Production: The Expected and the Unexpected , 2016, FAST.

[5]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[6]  Sriram Sankar,et al.  Impact of temperature on hard disk drive reliability in large datacenters , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[7]  Ahmed Amer,et al.  Protecting RAID Arrays against Unexpectedly High Disk Failure Rates , 2014, 2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing.

[8]  Qiang Wu,et al.  A Large-Scale Study of Flash Memory Failures in the Field , 2015, SIGMETRICS 2015.

[9]  Alan C. Bovik,et al.  Mean squared error: Love it or leave it? A new look at Signal Fidelity Measures , 2009, IEEE Signal Processing Magazine.

[10]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[11]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[12]  Andrew A. Chien,et al.  The Tail at Store: A Revelation from Millions of Hours of Disk and SSD Deployments , 2016, FAST.

[13]  Arkady Kanevsky,et al.  Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics , 2008, TOS.

[14]  Bianca Schroeder,et al.  Understanding latent sector errors and how to protect against them , 2010, TOS.

[15]  Peng Li,et al.  Improving Service Availability of Cloud Systems by Predicting Disk Error , 2018, USENIX ATC.

[16]  Qiang Miao,et al.  Online Anomaly Detection for Hard Disk Drives Based on Mahalanobis Distance , 2013, IEEE Transactions on Reliability.

[17]  Antony I. T. Rowstron,et al.  Feeding the Pelican: Using Archival Hard Drives for Cold Storage Racks , 2016, HotStorage.

[18]  Fred Douglis,et al.  RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures , 2015, FAST.

[19]  브라이언 에스. 메로우 Vibration isolation within disk drive testing systems , 2009 .

[20]  Viv Bewick,et al.  Statistics review 13: Receiver operating characteristic curves , 2004, Critical care.

[21]  Jasmina Bogojeska,et al.  Predicting Disk Replacement towards Reliable Data Centers , 2016, KDD.

[22]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Hai Jin,et al.  Disk Failure Prediction in Data Centers via Online Learning , 2018, ICPP.

[24]  Mu Qiao,et al.  Large Scale Predictive Analytics for Hard Disk Remaining Useful Life Estimation , 2018, 2018 IEEE International Congress on Big Data (BigData Congress).

[25]  Song Huang,et al.  Characterizing Disk Failures with Quantified Disk Degradation Signatures: An Early Experience , 2015, 2015 IEEE International Symposium on Workload Characterization.

[26]  Bianca Schroeder,et al.  Proactive error prediction to improve storage system reliability , 2017, USENIX ATC.

[27]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[28]  Wei Xu,et al.  What Can We Learn from Four Years of Data Center Hardware Failures? , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[29]  Bruce Allen,et al.  Monitoring hard disks with smart , 2004 .

[30]  J. Friedman Stochastic gradient boosting , 2002 .

[31]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[32]  Jiesheng Wu,et al.  Lessons and Actions: What We Learned from 10K SSD-Related Storage System Failures , 2019, USENIX Annual Technical Conference.

[33]  Onur Mutlu,et al.  Threshold voltage distribution in MLC NAND flash memory: Characterization, analysis, and modeling , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[34]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[35]  Weimin Zheng,et al.  Predicting Disk Failures with HMM- and HSMM-Based Approaches , 2010, ICDM.

[36]  Zhaohui Zheng,et al.  Stochastic gradient boosted distributed decision trees , 2009, CIKM.

[37]  Javam C. Machado,et al.  A Fault Detection Method for Hard Disk Drives Based on Mixture of Gaussians and Nonparametric Statistics , 2017, IEEE Transactions on Industrial Informatics.

[38]  Yong Wang,et al.  SDF: software-defined flash for web-scale internet storage systems , 2014, ASPLOS.

[39]  Jing Shen,et al.  Random-forest-based failure prediction for hard disk drives , 2018, Int. J. Distributed Sens. Networks.

[40]  Onur Mutlu,et al.  Data retention in MLC NAND flash memory: Characterization, optimization, and recovery , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[41]  Peter Molnar,et al.  What SMART Stats Tell Us About Hard Drives , 2017 .

[42]  S. Shah,et al.  Server class disk drives: how reliable are they? , 2004, Annual Symposium Reliability and Maintainability, 2004 - RAMS.

[43]  Moisés Goldszmidt Finding Soon-to-Fail Disks in a Haystack , 2012, HotStorage.

[44]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[45]  Gregory R. Ganger,et al.  Cluster storage systems gotta have HeART: improving storage efficiency by exploiting disk-reliability heterogeneity , 2019, FAST.

[46]  N. Perkins,et al.  Optimal Cut-point and Its Corresponding Youden Index to Discriminate Individuals Using Pooled Blood Samples , 2005, Epidemiology.

[47]  S. Shah,et al.  Reliability analysis of disk drive failure mechanisms , 2005, Annual Reliability and Maintainability Symposium, 2005. Proceedings..

[48]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[49]  Onur Mutlu,et al.  Program interference in MLC NAND flash memory: Characterization, modeling, and mitigation , 2013, ICCD.

[50]  Xubin He,et al.  Failure Prediction Models for Proactive Fault Tolerance within Storage Systems , 2008, 2008 IEEE International Symposium on Modeling, Analysis and Simulation of Computers and Telecommunication Systems.

[51]  Tommy W. S. Chow,et al.  A Two-Step Parametric Method for Failure Prediction in Hard Disk Drives , 2014, IEEE Transactions on Industrial Informatics.

[52]  Jun'ichi Tsujii,et al.  Boosting Precision and Recall of Dictionary-Based Protein Name Recognition , 2003, BioNLP@ACL.

[53]  Razvan Pascanu,et al.  How to Construct Deep Recurrent Neural Networks , 2013, ICLR.

[54]  Jie Xu,et al.  An Empirical Failure-Analysis of a Large-Scale Cloud Computing Environment , 2014, 2014 IEEE 15th International Symposium on High-Assurance Systems Engineering.

[55]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[56]  Ji-Hyun Kim,et al.  Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap , 2009, Comput. Stat. Data Anal..

[57]  José Antonio Lozano,et al.  Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Onur Mutlu,et al.  Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives , 2017, Proceedings of the IEEE.

[59]  J. G. Elerath,et al.  Disk drive reliability case study: dependence upon head fly-height and quantity of heads , 2003, Annual Reliability and Maintainability Symposium, 2003..

[60]  Javam C. Machado,et al.  Hard Disk Drive Failure Prediction Method Based On A Bayesian Network , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[61]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[62]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[63]  Onur Mutlu,et al.  Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[64]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[65]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[66]  Osman S. Unsal,et al.  Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[67]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[68]  Matthew England,et al.  A Combined CNN and LSTM Model for Arabic Sentiment Analysis , 2018, CD-MAKE.

[69]  Gang Wang,et al.  Hard Drive Failure Prediction Using Classification and Regression Trees , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[70]  Shi-Feng Huang,et al.  Real-time big data analytics for hard disk drive predictive maintenance , 2018, Comput. Electr. Eng..

[71]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[72]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[73]  S. H. Shah Newaz,et al.  Empirical Comparison of Area under ROC curve (AUC) and Mathew Correlation Coefficient (MCC) for Evaluating Machine Learning Algorithms on Imbalanced Datasets for Binary Classification , 2019, ICMLSC.

[74]  B. Reiser,et al.  Estimation of the Youden Index and its Associated Cutoff Point , 2005, Biometrical journal. Biometrische Zeitschrift.

[75]  Javam C. Machado,et al.  Predicting Failures in Hard Drives with LSTM Networks , 2017, 2017 Brazilian Conference on Intelligent Systems (BRACIS).

[76]  Feng-Bin Sun,et al.  A comprehensive review of hard-disk drive reliability , 1999, Annual Reliability and Maintainability. Symposium. 1999 Proceedings (Cat. No.99CH36283).

[77]  Jasper Snoek,et al.  Spectral Representations for Convolutional Neural Networks , 2015, NIPS.

[78]  Cesare Furlanello,et al.  A Comparison of MCC and CEN Error Measures in Multi-Class Prediction , 2010, PloS one.

[79]  Steven Swanson,et al.  The bleak future of NAND flash memory , 2012, FAST.

[80]  Joseph F. Murray,et al.  Hard drive failure prediction using non-parametric statistical methods , 2003 .

[81]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[82]  Sophie Chabridon,et al.  Predictive Models of Hard Drive Failures Based on Operational Data , 2017, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).

[83]  Osman S. Unsal,et al.  Neighbor-cell assisted error correction for MLC NAND flash memories , 2014, SIGMETRICS '14.