NTAM: Neighborhood-Temporal Attention Model for Disk Failure Prediction in Cloud Platforms

With the rapid deployment of cloud platforms, high service reliability is of critical importance. An industrial cloud platform contains a huge number of disks, and disk failure is a common cause of service unreliability. In recent years, many machine learning based disk failure prediction approaches have been proposed, and they can predict disk failures based on disk status data before the failures actually happen. In this way, proactive actions can be taken in advance to improve service reliability. However, existing approaches treat each disk individually and do not explore the influence of the neighboring disks. In this paper, we propose Neighborhood-Temporal Attention Model (NTAM), a novel deep learning based approach to disk failure prediction. When predicting whether or not a disk will fail in near future, NTAM is a novel approach that not only utilizes a disk’s own status data, but also considers its neighbors’ status data. Moreover, NTAM includes a novel attention-based temporal component to capture the temporal nature of the disk status data. Besides, we propose a data enhancement method, called Temporal Progressive Sampling (TPS), to handle the extreme data imbalance issue. We evaluate NTAM on a public dataset as well as two industrial datasets collected from millions of disks in Microsoft Azure. Our experimental results show that NTAM significantly outperforms state-of-the-art competitors. Also, our empirical evaluations indicate the effectiveness of the neighborhood-ware component and the temporal component underlying NTAM as well as the effectiveness of TPS. More encouragingly, we have successfully applied NTAM and TPS to Microsoft cloud platforms (including Microsoft Azure and Microsoft 365) and obtained benefits in industrial practice.

[1]  Shaukat Ali Shahee,et al.  An Adaptive Oversampling Technique for Imbalanced Datasets , 2018, ICDM.

[2]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[3]  Gang Wang,et al.  Hard Drive Failure Prediction Using Classification and Regression Trees , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[4]  Meng Ma,et al.  AutoMAP: Diagnose Your Microservice-based Web Applications Automatically , 2020, WWW.

[5]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[6]  Mark T. Keane,et al.  The Expected Unexpected & Unexpected Unexpected , 2019, CogSci.

[7]  Murali Chintalapati,et al.  Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure , 2020, NSDI.

[8]  Hiranya Jayathilaka,et al.  Performance Monitoring and Root Cause Analysis for Cloud-hosted Web Applications , 2017, WWW.

[9]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[10]  Qingwei Lin,et al.  Efficient incident identification from multi-dimensional issue reports via meta-heuristic search , 2020, ESEC/SIGSOFT FSE.

[11]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[12]  Krishnendu Chakrabarty,et al.  System-level hardware failure prediction using deep learning , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[13]  Lantao Yu,et al.  Dynamic Attention Deep Model for Article Recommendation by Learning Human Editors' Demonstration , 2017, KDD.

[14]  Pu Zhao,et al.  Intelligent Virtual Machine Provisioning in Cloud Computing , 2020, IJCAI.

[15]  Tie-Yan Liu,et al.  Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks , 2016, IEEE Transactions on Computers.

[16]  Peng Li,et al.  Improving Service Availability of Cloud Systems by Predicting Disk Error , 2018, USENIX ATC.

[17]  Qiang Wu,et al.  A Large-Scale Study of Flash Memory Failures in the Field , 2015, SIGMETRICS 2015.

[18]  Pu Zhao,et al.  Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions , 2020, OSDI.

[19]  Huanhuan Chen,et al.  Condition Aware and Revise Transformer for Question Answering , 2020, WWW.

[20]  Xiangning Chen,et al.  Neural Feature Search: A Neural Architecture for Automated Feature Engineering , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[21]  Weimin Zheng,et al.  Predicting Disk Failures with HMM- and HSMM-Based Approaches , 2010, ICDM.

[22]  Jing Shen,et al.  Random-forest-based failure prediction for hard disk drives , 2018, Int. J. Distributed Sens. Networks.

[23]  Asim Kadav,et al.  A Context-aware Attention Network for Interactive Question Answering , 2016, KDD.

[24]  Shaowei Cai,et al.  Correlation-Aware Heuristic Search for Intelligent Virtual Machine Provisioning in Cloud Systems , 2021, AAAI.

[25]  Weisong Shi,et al.  Making Disk Failure Predictions SMARTer! , 2020, FAST.

[26]  Ruifang He,et al.  Multiple Knowledge Syncretic Transformer for Natural Dialogue Generation , 2020, WWW.

[27]  Xindong Wu,et al.  10 Challenging Problems in Data Mining Research , 2006, Int. J. Inf. Technol. Decis. Mak..

[28]  Tong Zhang,et al.  Learning Nonlinear Functions Using Regularized Greedy Forest , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Matthew S. Nokleby,et al.  Learning Deep Networks from Noisy Labels with Dropout Regularization , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[30]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[31]  Rajiv Ranjan,et al.  CloudGenius: decision support for web server cloud migration , 2012, WWW.

[32]  Philip S. Yu,et al.  Layerwise Perturbation-Based Adversarial Training for Hard Drive Health Degree Prediction , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[33]  Barbara Panicucci,et al.  A game theoretic formulation of the service provisioning problem in cloud systems , 2011, WWW.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Gang Wang,et al.  Proactive drive failure prediction for large scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[36]  Yu Kang,et al.  Efficient customer incident triage via linking with system incidents , 2020, ESEC/SIGSOFT FSE.

[37]  Anshul Gandhi,et al.  Using Variability as a Guiding Principle to Reduce Latency in Web Applications via OS Profiling , 2019, WWW.

[38]  Yi Pan,et al.  Multi-Horizon Time Series Forecasting with Temporal Attention Learning , 2019, KDD.

[39]  Shaowei Cai,et al.  PULNS: Positive-Unlabeled Learning with Effective Negative Sample Selector , 2021, AAAI.

[40]  Timothy P. Lillicrap,et al.  Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[41]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[42]  Tao Qin,et al.  Selling Reserved Instances in Cloud Computing , 2015, IJCAI.

[43]  Sriram Sankar,et al.  Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures , 2013, TOS.

[44]  Bruce Allen,et al.  Monitoring hard disks with smart , 2004 .

[45]  Xiaofeng He,et al.  ?-Diagnosis: Unsupervised and Real-time Diagnosis of Small- window Long-tail Latency in Large-scale Microservice Platforms , 2019, WWW.

[46]  Hang Dong,et al.  Outage Prediction and Diagnosis for Cloud Service Systems , 2019, WWW.

[47]  Xiaohong Huang Hard Drive Failure Prediction for Large Scale Storage System , 2017 .

[48]  Jasmina Bogojeska,et al.  Predicting Disk Replacement towards Reliable Data Centers , 2016, KDD.

[49]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.