Failure prediction based on log files using Random Indexing and Support Vector Machines

Research problem: The impact of failures on software systems can be substantial since the recovery process can require unexpected amounts of time and resources. Accurate failure predictions can help in mitigating the impact of failures. Resources, applications, and services can be scheduled to limit the impact of failures. However, providing accurate predictions sufficiently ahead is challenging. Log files contain messages that represent a change of system state. A sequence or a pattern of messages may be used to predict failures. Contribution: We describe an approach to predict failures based on log files using Random Indexing (RI) and Support Vector Machines (SVMs). Method: RI is applied to represent sequences: each operation is characterized in terms of its context. SVMs associate sequences to a class of failures or non-failures. Weighted SVMs are applied to deal with imbalanced datasets and to improve the true positive rate. We apply our approach to log files collected during approximately three months of work in a large European manufacturing company. Results: According to our results, weighted SVMs sacrifice some specificity to improve sensitivity. Specificity remains higher than 0.80 in four out of six analyzed applications. Conclusions: Overall, our approach is very reliable in predicting both failures and non-failures.

[1]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[2]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[3]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[4]  F. Tay,et al.  Application of support vector machines in financial time series forecasting , 2001 .

[5]  Hausi A. Müller,et al.  Autonomic Computing Now You See It, Now You Don't , 2008, ISSSE.

[6]  Miroslaw Malek,et al.  Predicting failures of computer systems: a case study for a telecommunication system , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[7]  Wu Meng,et al.  Application of Support Vector Machines in Financial Time Series Forecasting , 2007 .

[8]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[9]  Anne-Laure Boulesteix,et al.  WilcoxCV: an R package for fast variable selection in cross-validation , 2007, Bioinform..

[10]  Zongben Xu,et al.  Three improved neural network models for air quality forecasting , 2003 .

[11]  Yue Jiang,et al.  Techniques for evaluating fault prediction models , 2008, Empirical Software Engineering.

[12]  Niladri Chatterjee,et al.  Discovering Word Senses from Text Using Random Indexing , 2008, CICLing.

[13]  David M. Allen,et al.  The Relationship Between Variable Selection and Data Agumentation and a Method for Prediction , 1974 .

[14]  Zhiling Lan,et al.  System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[15]  Glenn A. Fink,et al.  Predicting Computer System Failures Using Support Vector Machines , 2008, WASL.

[16]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[17]  K. C. Gross,et al.  Proactive detection of software aging mechanisms in performance critical computers , 2002, 27th Annual NASA Goddard/IEEE Software Engineering Workshop, 2002. Proceedings..

[18]  G. Wahba,et al.  A completely automatic french curve: fitting spline functions by cross validation , 1975 .

[19]  R. Dennis Cook,et al.  Cross-Validation of Regression Models , 1984 .

[20]  Suresh K. Choubey,et al.  Failure event prediction using the Cox proportional hazard model driven by frequent failure signatures , 2007 .

[21]  Hsueh-Wei Chang,et al.  Support Vector Machine-based Prediction for Oral Cancer Using Four SNPs in DNA Repair Genes , 2011 .

[22]  Magnus Sahlgren,et al.  The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces , 2006 .

[23]  A. Akhmetova Discovery of Frequent Episodes in Event Sequences , 2006 .

[24]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.

[25]  Ronald D. Snee,et al.  Validation of Regression Models: Methods and Examples , 1977 .

[26]  Federico Girosi,et al.  Support Vector Machines: Training and Applications , 1997 .

[27]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[28]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[29]  Raja Sengupta,et al.  Diagnosability of discrete-event systems , 1995, IEEE Trans. Autom. Control..

[30]  V. S. Srinivasan,et al.  Fault detection/monitoring using time Petri nets , 1993, IEEE Trans. Syst. Man Cybern..

[31]  Edward Y. Chang,et al.  Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .

[32]  Mohamed Mohandes,et al.  Support vector machines for wind speed prediction , 2004 .

[33]  Siyuan Ma,et al.  A Survey on Failure Prediction of Large-Scale Server Clusters , 2007, Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007).

[34]  Seymour Geisser,et al.  The Predictive Sample Reuse Method with Applications , 1975 .

[35]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[36]  Yi-Zeng Liang,et al.  Monte Carlo cross‐validation for selecting a model and estimating the prediction error in multivariate calibration , 2004 .

[37]  Yannick Pencolé Diagnosability Analysis of Distributed Discrete Event Systems , 2004, ECAI.

[38]  M. Stone Cross-validation:a review 2 , 1978 .

[39]  Ravishankar K. Iyer,et al.  Error/failure analysis using event logs from fault tolerant systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[40]  Xiaoshe Dong,et al.  A Survey on Failure Prediction of Large-Scale Server Clusters , 2007 .

[41]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[42]  Jon Stearley,et al.  Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[43]  Ricardo Vilalta,et al.  Predicting rare events in temporal domains , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[44]  Alberto Sillitti,et al.  A case-study on using an Automated In-process Software Engineering Measurement and Analysis system in an industrial environment , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[45]  Hui Xiong,et al.  Failure Prediction in IBM BlueGene/L Event Logs , 2007, ICDM.

[46]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[47]  Keith Stevens,et al.  The S-Space Package: An Open Source Package for Word Space Models , 2010, ACL.

[48]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[49]  Zhiling Lan,et al.  Toward Automated Anomaly Identification in Large-Scale Systems , 2010, IEEE Transactions on Parallel and Distributed Systems.

[50]  Lawrence E. Holloway,et al.  Template languages for fault monitoring of timed discrete event processes , 2000, IEEE Trans. Autom. Control..

[51]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[52]  Felix Salfner,et al.  Event-based Failure Prediction: An Extended Hidden Markov Model Approach , 2008, Ausgezeichnete Informatikdissertationen.

[53]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[54]  Padhraic Smyth,et al.  Clustering Using Monte Carlo Cross-Validation , 1996, KDD.

[55]  Ping-Feng Pai,et al.  Software reliability forecasting by support vector machines with simulated annealing algorithms , 2006, J. Syst. Softw..

[56]  S. Wold Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models , 1978 .

[57]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[58]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[59]  Avi Ma'ayan,et al.  SVM classifier to predict genes important for self-renewal and pluripotency of mouse embryonic stem cells , 2010, BMC Systems Biology.

[60]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[61]  Peter A. Flach The Geometry of ROC Space: Understanding Machine Learning Metrics through ROC Isometrics , 2003, ICML.

[62]  Rickard Cöster,et al.  Using Bag-of-Concepts to Improve the Performance of Support Vector Machines in Text Categorization , 2004, COLING.

[63]  Lijuan Cao,et al.  Support vector machines experts for time series forecasting , 2003, Neurocomputing.

[64]  Frederick Mosteller,et al.  Data Analysis and Regression , 1978 .

[65]  Desheng Dash Wu,et al.  Using text mining and sentiment analysis for online forums hotspot detection and forecast , 2010, Decis. Support Syst..

[66]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[67]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[68]  Claes Wohlin,et al.  Experimentation in software engineering: an introduction , 2000 .

[69]  Corinna Cortes,et al.  Prediction of Generalization Ability in Learning Machines , 1994 .

[70]  Kenji Yamanishi,et al.  Dynamic syslog mining for network failure monitoring , 2005, KDD '05.