Desh: deep learning for system health prediction of lead times to failure in HPC

Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are likely to experience even higher fault rates due to increased component count and density. Triggering resilience-mitigating techniques remains a challenge due to the absence of well defined failure indicators. System logs consist of unstructured text that obscures essential system health information contained within. In this context, efficient failure prediction via log mining can enable proactive recovery mechanisms to increase reliability. This work aims to predict node failures that occur in supercomputing systems via long short-term memory (LSTM) networks that exploit recurrent neural networks (RNNs). Our framework, Desh1 (Deep Learning for System Health), diagnoses and predicts failures with short lead times. Desh identifies failure indicators with enhanced training and classification for generic applicability to logs from operating systems and software components without the need to modify any of them. Desh uses a novel three-phase deep learning approach to (1) train to recognize chains of log events leading to a failure, (2) re-train chain recognition of events augmented with expected lead times to failure, and (3) predict lead times during testing/inference deployment to predict which specific node fails in how many minutes. Desh obtains as high as 3 minutes average lead time with no less than 85% recall and 83% accuracy to take proactive actions on the failing nodes, which could be used to migrate computation to healthy nodes.

[1]  Scott Klasky,et al.  Predicting Output Performance of a Petascale Supercomputer , 2017, HPDC.

[2]  Rajeev Thakur,et al.  A study of dynamic meta-learning for failure prediction in large-scale systems , 2010, J. Parallel Distributed Comput..

[3]  Hongbing Wang,et al.  Online Reliability Prediction via Long Short Term Memory for Service-Oriented Systems , 2017, 2017 IEEE International Conference on Web Services (ICWS).

[4]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[5]  Zhiling Lan,et al.  3-Dimensional root cause diagnosis via co-analysis , 2012, ICAC '12.

[6]  Alexander Aiken,et al.  Alert Detection in System Logs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[7]  Xiao Yu,et al.  CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs , 2016, ASPLOS.

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Jian Li,et al.  An Evaluation Study on Log Parsing and Its Use in Log Mining , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[10]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[11]  Chung-Hsien Wu,et al.  Mood detection from daily conversational speech using denoising autoencoder and LSTM , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Franck Cappello,et al.  Failure prediction for HPC systems and applications , 2013, Int. J. High Perform. Comput. Appl..

[13]  Franck Cappello,et al.  Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[14]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[15]  Saurabh Gupta,et al.  Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[16]  Christian Engelmann,et al.  Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Feifei Li,et al.  DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning , 2017, CCS.

[19]  Franck Cappello,et al.  Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[21]  Domenico Cotroneo,et al.  Assessing time coalescence techniques for the analysis of supercomputer logs , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[22]  Osman S. Unsal,et al.  Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  John Liagouris,et al.  Online Reconstruction of Structural Information from Datacenter Logs , 2017, EuroSys.

[24]  Bin Nie,et al.  Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities , 2017, 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).

[25]  Pavel Tariqul Islam,et al.  Predicting Application Failure in Cloud: A Machine Learning Approach , 2017, 2017 IEEE International Conference on Cognitive Computing (ICCC).

[26]  Catello Di Martino One Size Does Not Fit All: Clustering Supercomputer Failures Using a Multiple Time Window Approach , 2013, ISC.

[27]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[28]  Chao Wang,et al.  Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  M. Snir,et al.  Navigating the Blue Waters : Online Failure Prediction in the Petascale Era , 2013 .

[30]  Xiaohui Gu,et al.  UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems , 2012, ICAC '12.

[31]  Zhiling Lan,et al.  Toward Automated Anomaly Identification in Large-Scale Systems , 2010, IEEE Transactions on Parallel and Distributed Systems.

[32]  Domenico Cotroneo,et al.  Improving Log-based Field Failure Data Analysis of multi-node computing systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[33]  Tong Zhang,et al.  Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings , 2016, ICML.

[34]  Sally A. McKee,et al.  Digging deeper into cluster system logs for failure prediction and root cause diagnosis , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[35]  Narayan Desai,et al.  Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[36]  Yu Luo,et al.  Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle , 2016, OSDI.

[37]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[38]  Ke Zhang,et al.  2016 Ieee International Conference on Big Data (big Data) Automated It System Failure Prediction: a Deep Learning Approach , 2022 .

[39]  Lars Grunske,et al.  Hora: Architecture-aware online failure prediction , 2017, J. Syst. Softw..

[40]  Saurabh Gupta,et al.  Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[41]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[42]  Song Fu,et al.  Relational Synthesis of Text and Numeric Data for Anomaly Detection on Computing System Logs , 2016, 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA).

[43]  Glenn A. Fink,et al.  Predicting Computer System Failures Using Support Vector Machines , 2008, WASL.