Lifelong Disk Failure Prediction via GAN-Based Anomaly Detection

As a classical technique in storage systems, disk failure prediction aims at predicting impending disk failures in advance for high data reliability. Over the past decades, taking as input the SMART (Self-Monitoring, Analysis and Reporting Technology) attributes, many supervised machine learning algorithms have been proven to be effective for disk failure prediction. However, these approaches heavily rely on the availability of substantial annotated failed disk data which unfortunately exhibits an extreme data imbalance, i.e., the number of failed disks is much smaller than that of healthy ones, resulting in suboptimal performance and even inability at the beginning of their deployment, i.e., cold starting problem. Inspired by the significant success achieved in GAN (Generative Adversarial Network) based anomaly detection, in this paper, we translate disk failure prediction into an anomaly detection problem. Specifically, we develop a novel Semi-supervised method for lifelong disk failure Prediction via Adversarial training, called SPA. The distinguishing feature of SPA from existing supervised approaches is that SPA is only trained on healthy disks, which avoids the traditional limitations of imbalance in datasets and eliminates the cold starting problem. Furthermore, a novel 2D image-like representation technique is proposed to enable the deployment of deep learning techniques and the automatic feature extraction. Experimental results on real-world SMART datasets demonstrate that, compared with the state-of-the-art supervised machine learning based methods, our approach predicts disk failures at a higher accuracy for the entire lifetime of models, i.e., both the initial period and the long-term usage.

[1]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[2]  Seungjin Choi,et al.  Multi-modal Convolutional Neural Networks for Activity Recognition , 2015, 2015 IEEE International Conference on Systems, Man, and Cybernetics.

[3]  Jasmina Bogojeska,et al.  Predicting Disk Replacement towards Reliable Data Centers , 2016, KDD.

[4]  Joseph F. Murray,et al.  Improved disk-drive failure warnings , 2002, IEEE Trans. Reliab..

[5]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[6]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[7]  Bruce Allen,et al.  Monitoring hard disks with smart , 2004 .

[8]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Yunqian Ma,et al.  Imbalanced Learning: Foundations, Algorithms, and Applications , 2013 .

[11]  Peng Li,et al.  Improving Service Availability of Cloud Systems by Predicting Disk Error , 2018, USENIX ATC.

[12]  Qiang Miao,et al.  Online Anomaly Detection for Hard Disk Drives Based on Mahalanobis Distance , 2013, IEEE Transactions on Reliability.

[13]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[14]  Gang Wang,et al.  Being Accurate Is Not Enough: New Metrics for Disk Failure Prediction , 2016, 2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS).

[15]  GhemawatSanjay,et al.  The Google file system , 2003 .

[16]  Tommy W. S. Chow,et al.  A Two-Step Parametric Method for Failure Prediction in Hard Disk Drives , 2014, IEEE Transactions on Industrial Informatics.

[17]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[18]  Wenjun Yang,et al.  Hard Drive Failure Prediction Using Big Data , 2015, 2015 IEEE 34th Symposium on Reliable Distributed Systems Workshop (SRDSW).

[19]  Hong Jiang,et al.  XI-Code: A Family of Practical Lowest Density MDS Array Codes of Distance 4 , 2016, IEEE Transactions on Communications.

[20]  Hai Jin,et al.  Disk Failure Prediction in Data Centers via Online Learning , 2018, ICPP.

[21]  Weimin Zheng,et al.  Predicting Disk Failures with HMM- and HSMM-Based Approaches , 2010, ICDM.

[22]  Gang Wang,et al.  Hard Drive Failure Prediction Using Classification and Regression Trees , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[23]  Bo Tang,et al.  Semisupervised Feature Selection Based on Relevance and Redundancy Criteria , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[24]  Joseph F. Murray,et al.  Hard drive failure prediction using non-parametric statistical methods , 2003 .

[25]  Seungjin Choi,et al.  Convolutional neural networks for human activity recognition using multiple accelerometer and gyroscope sensors , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[26]  Moisés Goldszmidt Finding Soon-to-Fail Disks in a Haystack , 2012, HotStorage.

[27]  Joseph F. Murray,et al.  Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application , 2005, J. Mach. Learn. Res..

[28]  Greg Hamerly,et al.  Bayesian approaches to failure prediction for disk drives , 2001, ICML.

[29]  Toby P. Breckon,et al.  GANomaly: Semi-Supervised Anomaly Detection via Adversarial Training , 2018, ACCV.

[30]  Gang Wang,et al.  Proactive drive failure prediction for large scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[31]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[32]  Arkady Kanevsky,et al.  Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics , 2008, TOS.