Evaluation of Causal Inference Techniques for AIOps

Inferring causality of events from log data is critical to IT operations teams who continuously strive to identify probable root causes of events in order to quickly resolve incident tickets so that downtimes and service interruptions are kept to a minimum. Although prior work has applied some specific causal inference techniques on proprietary log data, they fail to benchmark the performance of different techniques on a common system or dataset. In this work, we evaluate the performance of multiple state-of-the-art causal inference techniques using log data obtained from a publicly available benchmark microservice system. We model log data both as a timeseries of error counts and as a temporal event sequence and evaluate 3 families of Granger causal techniques: regression based, independence testing based, and event models. Our preliminary results indicate that event models yield causal graphs that have high precision and recall in comparison to regression and independence testing based Granger methods.

[1]  Todd P. Coleman,et al.  Directed Information Graphs , 2012, IEEE Transactions on Information Theory.

[2]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[3]  C. Granger Investigating causal relations by econometric models and cross-spectral methods , 1969 .

[4]  Larisa Shwartz,et al.  Leveraging AI in Service Automation Modeling: From Classical AI Through Deep Learning to Combination Models , 2019, ICSOC.

[5]  Eric V. Strobl,et al.  Approximate Kernel-Based Conditional Independence Tests for Fast Non-Parametric Causal Discovery , 2017, Journal of Causal Inference.

[6]  Yan Liu,et al.  Temporal causal modeling with graphical granger methods , 2007, KDD '07.

[7]  Dan Ding,et al.  Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study , 2018, IEEE Transactions on Software Engineering.

[8]  Hiroshi Esaki,et al.  Mining Causality of Network Events in Log Data , 2018, IEEE Transactions on Network and Service Management.

[9]  Jun Sun,et al.  Latent error prediction and fault localization for microservice applications by learning from system trace logs , 2019, ESEC/SIGSOFT FSE.

[10]  Donald B. Rubin,et al.  Rubin Causal Model , 2011, International Encyclopedia of Statistical Science.

[11]  Sergey M. Plis,et al.  Learning Dynamic Structure from Undersampled Data , 2017 .

[12]  Tian Gao,et al.  Proximal Graphical Event Models , 2018, NeurIPS.

[13]  Kush R. Varshney,et al.  Structure Learning from Time Series with False Discovery Control , 2018, ArXiv.

[14]  Risto Vaarandi,et al.  An unsupervised framework for detecting anomalous messages from syslog log files , 2018, NOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium.

[15]  Satoru Kobayashi,et al.  Causal analysis of network logs with layered protocols and topology knowledge , 2019, 2019 15th International Conference on Network and Service Management (CNSM).

[16]  M. Eichler GRAPHICAL MODELLING OF MULTIVARIATE TIME SERIES WITH LATENT VARIABLES , 2006 .

[17]  Chen Liang,et al.  Finding Needles in the Haystack: Harnessing Syslogs for Data Center Management , 2016, ArXiv.

[18]  P. Spirtes,et al.  Causation, Prediction, and Search, 2nd Edition , 2001 .

[19]  Christopher Meek,et al.  Universal Models of Multivariate Temporal Point Processes , 2016, AISTATS.

[20]  Illtyd Trethowan Causality , 1938 .

[21]  Yizhou Sun,et al.  Causal relation of queries from temporal logs , 2007, WWW '07.

[22]  J Runge,et al.  Causal network reconstruction from time series: From theoretical assumptions to practical estimation. , 2018, Chaos.

[23]  J. Geweke,et al.  Measures of Conditional Linear Dependence and Feedback between Time Series , 1984 .

[24]  Naoki Abe,et al.  Grouped graphical Granger modeling methods for temporal causal modeling , 2009, KDD.

[25]  Feifei Li,et al.  DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning , 2017, CCS.

[26]  Constantin F. Aliferis,et al.  The max-min hill-climbing Bayesian network structure learning algorithm , 2006, Machine Learning.

[27]  Anil K. Seth,et al.  The MVGC multivariate Granger causality toolbox: A new approach to Granger-causal inference , 2014, Journal of Neuroscience Methods.

[28]  Peter Bühlmann,et al.  Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm , 2007, J. Mach. Learn. Res..

[29]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[30]  Jun Sun,et al.  Benchmarking microservice systems for software engineering research , 2018, ICSE.

[31]  Qing Wang,et al.  Online inference for time-varying temporal dependency discovery from time series , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[32]  P. Spirtes,et al.  An Algorithm for Fast Recovery of Sparse Causal Graphs , 1991 .

[33]  E. Fox,et al.  Neural Granger Causality for Nonlinear Time Series , 2018, 1802.05842.