PREVENT: An Unsupervised Approach to Predict Software Failures in Production

This paper presents PREVENT, an approach for predicting and localizing failures in distributed enterprise applications by combining unsupervised techniques. Software failures can have dramatic consequences in production, and thus predicting and localizing failures is the essential step to activate healing measures that limit the disruptive consequences of failures. At the state of the art, many failures can be predicted from anomalous combinations of system metrics with respect to either rules provided from domain experts or supervised learning models. However, both these approaches limit the effectiveness of current techniques to well understood types of failures that can be either captured with predefined rules or observed while trining supervised models. PREVENT integrates the core ingredients of unsupervised approaches into a novel approach to predict failures and localize failing resources, without either requiring predefined rules or training with observed failures. The results of experimenting with PREVENT on a commercially-compliant distributed cloud system indicate that PREVENT provides more stable and reliable predictions, earlier than or comparably to supervised learning approaches, without requiring long and often impractical training with failures.

[1]  Haoyu Wang,et al.  Task Failure Prediction in Cloud Data Centers Using Deep Learning , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[2]  Oliviero Riganelli,et al.  Predicting Failures in Multi-Tier Distributed Systems , 2019, J. Syst. Softw..

[3]  Yaman Roumani,et al.  An empirical study on predicting cloud incidents , 2019, Int. J. Inf. Manag..

[4]  Krishnendu Chakrabarty,et al.  System-level hardware failure prediction using deep learning , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[5]  Mauro Pezzè,et al.  Energy-Based Anomaly Detection A New Perspective for Predicting Software Failures , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER).

[6]  Mauro Pezzè,et al.  An RBM Anomaly Detector for the Cloud , 2019, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[7]  George Botzoris,et al.  Modeling of Transport Demand: Analyzing, Calculating, and Forecasting Transport Demand , 2018 .

[8]  Leonardo Mariani,et al.  Localizing Faults in Cloud Systems , 2018, 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST).

[9]  Erik Elmroth,et al.  Adaptive Anomaly Detection in Performance Metric Streams , 2018, IEEE Transactions on Network and Service Management.

[10]  Subutai Ahmad,et al.  Unsupervised real-time anomaly detection for streaming data , 2017, Neurocomputing.

[11]  Feifei Li,et al.  DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning , 2017, CCS.

[12]  Leonardo Mariani,et al.  An Exploratory Study of Field Failures , 2017, 2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE).

[13]  Cemal Yilmaz,et al.  Seer: A Lightweight Online Failure Prediction Approach , 2017, IEEE Transactions on Software Engineering.

[14]  Pavel Tariqul Islam,et al.  Predicting Application Failure in Cloud: A Machine Learning Approach , 2017, 2017 IEEE International Conference on Cognitive Computing (ICCC).

[15]  Abdelmounaam Rezgui,et al.  FailureSim: A System for Predicting Hardware Failures in Cloud Data Centers Using Neural Networks , 2017, 2017 IEEE 10th International Conference on Cloud Computing (CLOUD).

[16]  Nhien-An Le-Khac,et al.  Collective Anomaly Detection Based on Long Short-Term Memory Recurrent Neural Networks , 2016, FDSE.

[17]  Kahina Lazri,et al.  Anomaly Detection and Root Cause Localization in Virtual Network Functions , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[18]  Joel J. P. C. Rodrigues,et al.  Network anomaly detection using IP flows with Principal Component Analysis and Ant Colony Optimization , 2016, J. Netw. Comput. Appl..

[19]  Biswanath Mukherjee,et al.  A Survey on Resiliency Techniques in Cloud Computing Infrastructures and Applications , 2016, IEEE Communications Surveys & Tutorials.

[20]  Erik Elmroth,et al.  Performance Anomaly Detection and Bottleneck Identification , 2015, ACM Comput. Surv..

[21]  Lenin Ravindranath,et al.  SunCat: helping developers understand and predict performance problems in smartphone applications , 2014, ISSTA 2014.

[22]  Xiao Zhang,et al.  Localization and centrality in networks , 2014, Physical review. E, Statistical, nonlinear, and soft matter physics.

[23]  Ahmed E. Hassan,et al.  Automatic detection of performance deviations in the load testing of Large Scale Systems , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[24]  Christian Igel,et al.  An Introduction to Restricted Boltzmann Machines , 2012, CIARP.

[25]  Xiaohui Gu,et al.  PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[26]  Xiaoyun Zhu,et al.  DAPA: Diagnosing Application Performance Anomalies for Virtualized Infrastructures , 2012, Hot-ICE.

[27]  João Paulo Magalhães,et al.  Adaptive Profiling for Root-Cause Analysis of Performance Anomalies in Web-Based Applications , 2011, 2011 IEEE 10th International Symposium on Network Computing and Applications.

[28]  John Scott,et al.  The SAGE Handbook of Social Network Analysis , 2011 .

[29]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[30]  João Paulo Magalhães,et al.  Root-cause analysis of performance anomalies in web-based applications , 2011, SAC.

[31]  Haixun Wang,et al.  Adaptive system anomaly prediction for large-scale hosting infrastructures , 2010, PODC.

[32]  João Paulo Magalhães,et al.  Detection of Performance Anomalies in Web-Based Applications , 2010, 2010 Ninth IEEE International Symposium on Network Computing and Applications.

[33]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[34]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[35]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[36]  Glenn A. Fink,et al.  Predicting Computer System Failures Using Support Vector Machines , 2008, WASL.

[37]  Guojing Cong,et al.  A framework for automated performance bottleneck detection , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[38]  Yan Liu,et al.  Temporal causal modeling with graphical granger methods , 2007, KDD '07.

[39]  Sergey N. Dorogovtsev,et al.  Critical phenomena in complex networks , 2007, ArXiv.

[40]  C. Granger Investigating causal relations by econometric models and cross-spectral methods , 1969 .

[41]  Faramarz Safi Esfahani,et al.  A threshold sensitive failure prediction method using support vector machine , 2017, Multiagent Grid Syst..

[42]  Ziming Zhang,et al.  Ensemble of Bayesian Predictors and Decision Trees for Proactive Failure Management in Cloud Computing Systems , 2012, J. Commun..

[43]  Skipper Seabold,et al.  Time Series Analysis in Python with statsmodels , 2011, SciPy.

[44]  Skipper Seabold,et al.  Statsmodels: Econometric and Statistical Modeling with Python , 2010, SciPy.

[45]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[46]  Amy Nicole Langville,et al.  A Survey of Eigenvector Methods for Web Information Retrieval , 2005, SIAM Rev..

[47]  D. Chandler,et al.  Introduction To Modern Statistical Mechanics , 1987 .

[48]  C. Faloutsos,et al.  diagnosing performance changes by comparing request Flows , 2022 .