A Data-driven Prognostic Architecture for Online Monitoring of Hard Disks Using Deep LSTM Networks

With the advent of pervasive cloud computing technologies, service reliability and availability are becoming major concerns,especially as we start to integrate cyber-physical systems with the cloud networks. A number of smart and connected community systems such as emergency response systems utilize cloud networks to analyze real-time data streams and provide context-sensitive decision support.Improving overall system reliability requires us to study all the aspects of the end-to-end of this distributed system,including the backend data servers. In this paper, we describe a bi-layered prognostic architecture for predicting the Remaining Useful Life (RUL) of components of backend servers,especially those that are subjected to degradation. We show that our architecture is especially good at predicting the remaining useful life of hard disks. A Deep LSTM Network is used as the backbone of this fast, data-driven decision framework and dynamically captures the pattern of the incoming data. In the article, we discuss the architecture of the neural network and describe the mechanisms to choose the various hyper-parameters. We describe the challenges faced in extracting effective training sets from highly unorganized and class-imbalanced big data and establish methods for online predictions with extensive data pre-processing, feature extraction and validation through test sets with unknown remaining useful lives of the hard disks. Our algorithm performs especially well in predicting RUL near the critical zone of a device approaching failure.The proposed architecture is able to predict whether a disk is going to fail in next ten days with an average precision of 0.8435.In future, we will extend this architecture to learn and predict the RUL of the edge devices in the end-to-end distributed systems of smart communities, taking into consideration context-sensitive external features such as weather.

[1]  Wei Xu,et al.  What Can We Learn from Four Years of Data Center Hardware Failures? , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[2]  WangDi,et al.  SSD Failures in Datacenters , 2016 .

[3]  Unsal Osman,et al.  Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer , 2016 .

[4]  Sankaran Mahadevan,et al.  Towards Reliability-Based Decision Making in Cyber-Physical Systems , 2016, 2016 IEEE International Conference on Smart Computing (SMARTCOMP).

[5]  Peng Li,et al.  Improving Service Availability of Cloud Systems by Predicting Disk Error , 2018, USENIX ATC.

[6]  Robert Birke,et al.  Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[7]  Mohammad Ali Zare Chahooki,et al.  A Survey on semi-supervised feature selection methods , 2017, Pattern Recognit..

[8]  Qiang Wu,et al.  Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[9]  Pavel Filonov,et al.  Multivariate Industrial Time Series with Cyber-Attack Simulation: Fault Detection Using an LSTM-based Predictive Data Model , 2016, ArXiv.

[10]  Luigi Carro,et al.  Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[11]  Bin Nie,et al.  A large-scale study of soft-errors on GPUs in the field , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[12]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[13]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[14]  F.O. Heimes,et al.  Recurrent neural networks for remaining useful life estimation , 2008, 2008 International Conference on Prognostics and Health Management.

[15]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[16]  Qiang Wu,et al.  A Large-Scale Study of Flash Memory Failures in the Field , 2015, SIGMETRICS 2015.

[17]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[18]  Richard Alan Peters,et al.  Particle Swarm Optimization: A survey of historical and recent developments with hybridization perspectives , 2018, Mach. Learn. Knowl. Extr..

[19]  Chetan S. Kulkarni,et al.  Data Driven Health Monitoring Approach to Extending Small Sats Mission , 2018, Annual Conference of the PHM Society.

[20]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[21]  Sophie Chabridon,et al.  Predictive Models of Hard Drive Failures Based on Operational Data , 2017, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).

[22]  Richard Alan Peters,et al.  QDDS: A Novel Quantum Swarm Algorithm Inspired by a Double Dirac Delta Potential , 2018, 2018 IEEE Symposium Series on Computational Intelligence (SSCI).

[23]  Ian K. Jennions,et al.  A Similarity-Based Prognostics Approach for Remaining Useful Life Prediction , 2014 .

[24]  Feng-Bin Sun,et al.  A comprehensive review of hard-disk drive reliability , 1999, Annual Reliability and Maintainability. Symposium. 1999 Proceedings (Cat. No.99CH36283).

[25]  Jinsong Yu,et al.  Remaining useful life prediction for lithium-ion batteries using a quantum particle swarm optimization-based particle filter , 2017 .

[26]  J. Sikora Disk failures in the real world : What does an MTTF of 1 , 000 , 000 hours mean to you ? , 2007 .

[27]  Jie Liu,et al.  SSD Failures in Datacenters: What? When? and Why? , 2016, SYSTOR.

[28]  Lovekesh Vig,et al.  Predicting Remaining Useful Life using Time Series Embeddings based on Recurrent Neural Networks , 2017, International Journal of Prognostics and Health Management.

[29]  Lovekesh Vig,et al.  LSTM-based Encoder-Decoder for Multi-sensor Anomaly Detection , 2016, ArXiv.

[30]  Vincenzo Piuri,et al.  Chapter 1 – Fault Tolerance and Resilience in Cloud Computing Environments , 2014 .

[31]  Bianca Schroeder,et al.  Proactive error prediction to improve storage system reliability , 2017, USENIX ATC.

[32]  Xiaoli Li,et al.  Deep Convolutional Neural Network Based Regression Approach for Estimation of Remaining Useful Life , 2016, DASFAA.