Spatio-temporal AI inference engine for estimating hard disk reliability

Abstract This paper focuses on building a spatio-temporal AI inference engine for estimating hard disk reliability. Most electronic systems such as hard disks routinely collect such reliability parameters in the field to monitor the health of the system. Changes in parameters as a function of time are monitored and any observed changes are compared with the known failure signatures. If the trajectory of the measured data matches that of a failure signature, operators are alerted to take corrective action. However, the interest of the operators lies in being able to identify the failures before they occur. The state of the art methodology including our prior work is to train machine learning models on temporal sequence data capturing the variations across multiple features and using it to predict the remaining useful life of the devices. However, as we show in this paper temporal prediction capability alone is not sufficient and can lead to low precision and the uncertainty around the prediction is very large. This is primarily due to the non-uniform progression of feature patterns over time. Our hypothesis is that the accuracy can be improved if we combine the temporal prediction methods with a spatial analysis that compares the value of key SMART features of the devices across similar model in a fixed time window (unlike the temporal method which uses the data from a single device and a much larger historical window). In this paper, we first describe both temporal and spatial approaches, describe the methods to select various hyperparameters, and then show a workflow to combine these two methodologies and provide comparative results. Our results illustrate that the average precision of temporal methods using long-short temporal memory networks to predict impending failures in the next ten days was 84 percent. To improve precision, we use the set of disks identified as potential failures and start applying spatial anomaly detection methods on those disks. This helps us remove the false positives from the temporal prediction results and provide a tighter bound on the set of disks with impending failure.

[1]  Mohammad Ali Zare Chahooki,et al.  A Survey on semi-supervised feature selection methods , 2017, Pattern Recognit..

[2]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[3]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[4]  Richard Alan Peters,et al.  Data Clustering using a Hybrid of Fuzzy C-Means and Quantum-behaved Particle Swarm Optimization , 2017, 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC).

[5]  Sanchita Basak,et al.  Analyzing the Cascading Effect of Traffic Congestion Using LSTM Networks , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[6]  Shujie Liu,et al.  Remaining Useful Life Model and Assessment of Mechanical Products: A Brief Review and a Note on the State Space Model Method , 2019, Chinese Journal of Mechanical Engineering.

[7]  Wei Xu,et al.  What Can We Learn from Four Years of Data Center Hardware Failures? , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[8]  Jasmina Bogojeska,et al.  Predicting Disk Replacement towards Reliable Data Centers , 2016, KDD.

[9]  Richard Alan Peters,et al.  Particle Swarm Optimization: A survey of historical and recent developments with hybridization perspectives , 2018, Mach. Learn. Knowl. Extr..

[10]  Lovekesh Vig,et al.  Predicting Remaining Useful Life using Time Series Embeddings based on Recurrent Neural Networks , 2017, International Journal of Prognostics and Health Management.

[11]  Abhishek Dubey,et al.  Mechanisms for Integrated Feature Normalization and Remaining Useful Life Estimation Using LSTMs Applied to Hard-Disks , 2018, 2019 IEEE International Conference on Smart Computing (SMARTCOMP).

[12]  Xiaoli Li,et al.  Deep Convolutional Neural Network Based Regression Approach for Estimation of Remaining Useful Life , 2016, DASFAA.

[13]  Joeri Van Mierlo,et al.  Data-driven health estimation and lifetime prediction of lithium-ion batteries: A review , 2019, Renewable and Sustainable Energy Reviews.

[14]  F.O. Heimes,et al.  Recurrent neural networks for remaining useful life estimation , 2008, 2008 International Conference on Prognostics and Health Management.

[15]  Lovekesh Vig,et al.  Long Short Term Memory Networks for Anomaly Detection in Time Series , 2015, ESANN.

[16]  Richard Alan Peters,et al.  QDDS: A Novel Quantum Swarm Algorithm Inspired by a Double Dirac Delta Potential , 2018, 2018 IEEE Symposium Series on Computational Intelligence (SSCI).

[17]  Sophie Chabridon,et al.  Predictive Models of Hard Drive Failures Based on Operational Data , 2017, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).

[18]  Lovekesh Vig,et al.  LSTM-based Encoder-Decoder for Multi-sensor Anomaly Detection , 2016, ArXiv.

[19]  Ian K. Jennions,et al.  A Similarity-Based Prognostics Approach for Remaining Useful Life Prediction , 2014 .

[20]  Jinsong Yu,et al.  Remaining useful life prediction for lithium-ion batteries using a quantum particle swarm optimization-based particle filter , 2017 .