System Failure Prediction through Rare-Events Elastic-Net Logistic Regression

Predicting failures in a distributed system based on previous events through logistic regression is a standard approach in literature. This technique is not reliable, though, in two situations: in the prediction of rare events, which do not appear in enough proportion for the algorithm to capture, and in environments where there are too many variables, as logistic regression tends to over fit on this situations, while manually selecting a subset of variables to create the model is error-prone. On this paper, we solve an industrial research case that presented this situation with a combination of elastic net logistic regression, a method that allows us to automatically select useful variables, a process of cross-validation on top of it and the application of a rare events prediction technique to reduce computation time. This process provides two layers of cross-validation that automatically obtain the optimal model complexity and the optimal model parameters values, while ensuring even rare events will be correctly predicted with a low amount of training instances. We tested this method against real industrial data, obtaining a total of 60 out of 80 possible models with a 90% average model accuracy.

[1]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[2]  A. Tikhonov On the stability of inverse problems , 1943 .

[3]  Miroslaw Malek,et al.  Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[4]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[5]  Zhiling Lan,et al.  System log pre-processing to improve failure prediction , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[6]  Zhiling Lan,et al.  A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[7]  Gary King,et al.  Logistic Regression in Rare Events Data , 2001, Political Analysis.

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  Jaros · aw Smoczek THE SURVEY OF SOFT COMPUTING TECHNIQUES FOR RELIABILITY PREDICTION , 2015 .

[10]  Tadeusz Niezgoda,et al.  Numerical analysis of a shaped rail pad under selected static load , 2015 .

[11]  Nina Zumel,et al.  Practical Data Science with R , 2014 .

[12]  Yasuhide Matsumoto,et al.  Prediction of failure occurrence time based on system log message pattern learning , 2012, 2012 IEEE Network Operations and Management Symposium.

[13]  Ziming Zhang,et al.  A Failure Detection and Prediction Mechanism for Enhancing Dependability of Data Centers , 2012 .

[14]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[15]  Marco Vieira,et al.  Towards Identifying the Best Variables for Failure Prediction Using Injection of Realistic Software Faults , 2010, 2010 IEEE 16th Pacific Rim International Symposium on Dependable Computing.

[16]  Shinji Kikuchi,et al.  Online failure prediction in cloud datacenters by real-time message pattern learning , 2012, 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings.

[17]  P. Whittle,et al.  Hypothesis-Testing in Time Series Analysis. , 1952 .

[18]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[19]  Peter Whittle,et al.  Hypothesis Testing in Time Series Analysis. , 1951 .

[20]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[21]  Zhiling Lan,et al.  Filtering log data: Finding the needles in the Haystack , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[22]  Ziming Zhang,et al.  Proactive Failure Management by Integrated Unsupervised and Semi-Supervised Learning for Dependable Cloud Systems , 2011, 2011 Sixth International Conference on Availability, Reliability and Security.

[23]  Tiranee Achalakul,et al.  Failure Prediction of Data Centers Using Time Series and Fault Tree Analysis , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[24]  Zhiling Lan,et al.  Practical online failure prediction for Blue Gene/P: Period-based vs event-driven , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W).