Automated IT system failure prediction: A deep learning approach

In mission critical IT services, system failure prediction becomes increasingly important; it prevents unexpected system downtime, and assures service reliability for end users. While operational console logs record rich and descriptive information on the health status of those IT systems, existing system management technologies mostly use them in a labor-intensive forensics approach, i.e., identifying what went wrong after the fact. Recent efforts on log-based system management take an automation approach with text mining techniques, such as term frequency — inverse document frequency (TF-IDF). However, those techniques lead to a high-dimensional feature space, and are not easily generalizable to heterogeneous log formats. In this paper, we present a novel system that automatically parses streamed console logs and detects early warning signals for IT system failure prediction. In particular, our solution includes a log pattern extraction method by clustering together logs with similar format and content. We then resemble the TF-IDF idea by considering each pattern as a word and the set of patterns in each discretized epoch as a document. This leads to a feature space with significantly lower dimensionality that can provide robust signals for the status of the system. As system failures tend to occur very rare, we apply a recurrent neural network, namely, Long Short-Term Memory (LSTM), to deal with the “rarity” of labeled data in the training process. LSTM is able to capture the long-range dependency across sequences, therefore outperforms traditional supervised learning methods in our application domain. We evaluated and compared our proposed technology with state-of-the-art machine learning approaches using real log traces from two large enterprise systems. The results showed the advantage and potentials of our system in prediction of complex IT failures. To our knowledge, our work is the first that employs LSTM for log-based system failure prediction.

[1]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[2]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[5]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[6]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[7]  Risto Vaarandi,et al.  A data clustering algorithm for mining patterns from event logs , 2003, Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003) (IEEE Cat. No.03EX764).

[8]  Wei Peng,et al.  An integrated framework on mining logs files for computing system management , 2005, KDD '05.

[9]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[10]  T. Munich,et al.  Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[11]  Hinrich Schütze,et al.  Introduction to Information Retrieval: Scoring, term weighting, and the vector space model , 2008 .

[12]  Evangelos E. Milios,et al.  Clustering event logs using iterative partitioning , 2009, KDD.

[13]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[14]  Ding Yuan,et al.  SherLog: error diagnosis by connecting clues from run-time logs , 2010, ASPLOS 2010.

[15]  Ding Yuan,et al.  SherLog: error diagnosis by connecting clues from run-time logs , 2010, ASPLOS XV.

[16]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[17]  Xia Ning,et al.  HLAer : a System for Heterogeneous Log Analysis , 2013 .

[18]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[20]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[21]  Xiangyu Zhang,et al.  IntroPerf: transparent context-sensitive multi-layer performance inference using system stack traces , 2014, SIGMETRICS '14.

[22]  Akio Watanabe,et al.  Spatio-temporal factorization of log data for understanding network events , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[23]  Zhuang Wang,et al.  Log-based predictive maintenance , 2014, KDD.

[24]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[25]  Harm de Vries,et al.  RMSProp and equilibrated adaptive learning rates for non-convex optimization. , 2015 .

[26]  Fred Douglis,et al.  RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures , 2015, FAST.

[27]  Akio Watanabe,et al.  Proactive failure detection learning generation patterns of large-scale network logs , 2015, 2015 11th International Conference on Network and Service Management (CNSM).

[28]  Hinrich Schütze,et al.  Scoring , term weighting and thevector space model , 2015 .

[29]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.