Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey

Themomentum gained bymicroservices and cloud-native software architecture pushed nowadays enterprise IT towards multi-service applications. The proliferation of services and service interactions within applications, often consisting of hundreds of interacting services, makes it harder to detect failures and to identify their possible root causes, which is on the other hand crucial to promptly recover and fix applications. Various techniques have been proposed to promptly detect failures based on their symptoms, viz., observing anomalous behaviour in one or more application services, as well as to analyse logs or monitored performance of such services to determine the possible root causes for observed anomalies. The objective of this survey is to provide a structured overview and a qualitative analysis of currently available techniques for anomaly detection and root cause analysis in modern multi-service applications. Some open challenges and research directions stemming out from the analysis are also discussed.

[1]  Franco Turini,et al.  A Survey of Methods for Explaining Black Box Models , 2018, ACM Comput. Surv..

[2]  Pengfei Chen,et al.  A Framework of Virtual War Room and Matrix Sketch-Based Streaming Anomaly Detection for Microservice Systems , 2020, IEEE Access.

[3]  Lars Grunske,et al.  An Architecture-Aware Approach to Hierarchical Online Failure Prediction , 2016, 2016 12th International ACM SIGSOFT Conference on Quality of Software Architectures (QoSA).

[4]  Jun Sun,et al.  Latent error prediction and fault localization for microservice applications by learning from system trace logs , 2019, ESEC/SIGSOFT FSE.

[5]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[6]  Feng Liu,et al.  Detecting Anomalous Behavior of Black-Box Services Modeled with Distance-Based Online Clustering , 2018, 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).

[7]  Odej Kao,et al.  Anomaly Detection from System Tracing Data Using Multimodal Deep Learning , 2019, 2019 IEEE 12th International Conference on Cloud Computing (CLOUD).

[8]  Pengfei Chen,et al.  On Anomaly Detection and Root Cause Analysis of Microservice Systems , 2018, ICSOC Workshops.

[9]  María S. Pérez-Hernández,et al.  Graph-based root cause analysis for service-oriented and microservice architectures , 2020, J. Syst. Softw..

[10]  Xiaohui Gu,et al.  PAL: Propagation-aware Anomaly Localization for cloud hosted distributed applications , 2011, SLAML '11.

[11]  Odej Kao,et al.  Self-Supervised Anomaly Detection from Distributed Traces , 2020, 2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC).

[12]  Huaimin Wang,et al.  Toward Fine-Grained, Unsupervised, Scalable Performance Diagnosis for Production Cloud Computing Systems , 2013, IEEE Transactions on Parallel and Distributed Systems.

[13]  Dan Ding,et al.  Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study , 2018, IEEE Transactions on Software Engineering.

[14]  Alexandre Termier,et al.  Anomaly Detection in Streams with Extreme Value Theory , 2017, KDD.

[15]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[16]  Xiaohui Gu,et al.  FChain: Toward Black-Box Online Fault Localization for Cloud Systems , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[17]  Ping Wang,et al.  FacGraph: Frequent Anomaly Correlation Graph Mining for Root Cause Diagnose in Micro-Service Architecture , 2018, 2018 IEEE 37th International Performance Computing and Communications Conference (IPCCC).

[18]  Ping Wang,et al.  CloudRanger: Root Cause Identification for Cloud Native Systems , 2018, 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[19]  Haikady N. Nagaraja,et al.  Inference in Hidden Markov Models , 2006, Technometrics.

[20]  Ying Li,et al.  LogSed: Anomaly Diagnosis through Mining Time-Weighted Control Flow Graph in Logs , 2017, 2017 IEEE 10th International Conference on Cloud Computing (CLOUD).

[21]  Wei Chen,et al.  A Fault Diagnosis Method for Microservices Based on Multi-Factor Self-Adaptive Heartbeat Detection Algorithm , 2018, 2018 2nd IEEE Conference on Energy Internet and Energy System Integration (EI2).

[22]  Lars Grunske,et al.  Hora: Architecture-aware online failure prediction , 2017, J. Syst. Softw..

[23]  Ping Wang,et al.  Lightweight and Adaptive Service API Performance Monitoring in Highly Dynamic Cloud Environment , 2017, 2017 IEEE International Conference on Services Computing (SCC).

[24]  Danai Koutra,et al.  Graph based anomaly detection and description: a survey , 2014, Data Mining and Knowledge Discovery.

[25]  Willem-Jan van den Heuvel,et al.  The pains and gains of microservices: A Systematic grey literature review , 2018, J. Syst. Softw..

[26]  Xin Peng,et al.  MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[27]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[28]  Meng Ma,et al.  AutoMAP: Diagnose Your Microservice-based Web Applications Automatically , 2020, WWW.

[29]  Tao Wang,et al.  Workflow-Aware Automatic Fault Diagnosis for Microservice-Based Applications With Statistics , 2020, IEEE Transactions on Network and Service Management.

[30]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[31]  Oliviero Riganelli,et al.  Predicting Failures in Multi-Tier Distributed Systems , 2019, J. Syst. Softw..

[32]  Victor Muntés-Mulero,et al.  Survey on Models and Techniques for Root-Cause Analysis , 2017, ArXiv.

[33]  Ping Wang,et al.  MS-Rank: Multi-Metric and Self-Adaptive Root Cause Diagnosis for Microservice Applications , 2019, 2019 IEEE International Conference on Web Services (ICWS).

[34]  Dan Ding,et al.  Graph-based trace analysis for microservice architecture understanding and problem diagnosis , 2020, ESEC/SIGSOFT FSE.

[35]  Jerome H. Saltzer,et al.  Principles of Computer System Design: An Introduction , 2009 .

[36]  A. Paradkar,et al.  Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals , 2020, ICSOC Workshops.

[37]  André van Hoorn,et al.  Model-driven online capacity management for component-based software systems , 2014, Softwaretechnik-Trends.

[38]  Christof Fetzer,et al.  Sieve: actionable insights from monitored metrics in distributed systems , 2017, Middleware.

[39]  Yuan He,et al.  Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices , 2019, ASPLOS.

[40]  Shenglin Zhang,et al.  Unsupervised Detection of Microservice Trace Anomalies through Service-Level Deep Bayesian Networks , 2020, 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE).

[41]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[43]  Nane Kratzke,et al.  Understanding cloud-native applications after 10 years of cloud computing - A systematic mapping study , 2017, J. Syst. Softw..

[44]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[45]  Yao Sun,et al.  Detecting anomalies in microservices with execution trace comparison , 2021, Future Gener. Comput. Syst..

[46]  Yan Liu,et al.  Temporal causal modeling with graphical granger methods , 2007, KDD '07.

[47]  Malgorzata Steinder,et al.  A survey of fault localization techniques in computer networks , 2004, Sci. Comput. Program..

[48]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[49]  Junjie Chen,et al.  Root-Cause Metric Location for Microservice Systems via Log Anomaly Detection , 2020, 2020 IEEE International Conference on Web Services (ICWS).

[50]  Leonardo Mariani,et al.  Localizing Faults in Cloud Systems , 2018, 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST).

[51]  Maliha S. Nash,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 2001, Technometrics.

[52]  Karthikeyan Shanmugam,et al.  Evaluation of Causal Inference Techniques for AIOps , 2020, COMAD/CODS.

[53]  Antonio Brogi,et al.  Identifying Failure Causalities in Multi-component Applications , 2019, SEFM Workshops.

[54]  P. Spirtes,et al.  Causation, Prediction, and Search, 2nd Edition , 2001 .

[55]  Jez Humble,et al.  Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation , 2010 .

[56]  G. Kesteven,et al.  The Coefficient of Variation , 1946, Nature.

[57]  Meng Ma,et al.  Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice Architecture , 2020, IEEE Transactions on Services Computing.

[58]  Rui Abreu,et al.  A Survey on Software Fault Localization , 2016, IEEE Transactions on Software Engineering.

[59]  Cristian S. Calude,et al.  The Deluge of Spurious Correlations in Big Data , 2016, Foundations of Science.

[60]  Francisco Durán,et al.  Live migration of trans-cloud applications , 2020, Comput. Stand. Interfaces.

[61]  Shenglin Zhang,et al.  Localizing Failure Root Causes in a Microservice through Causality Inference , 2020, 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS).

[62]  Gargi Dasgupta,et al.  Anomaly Detection Using Program Control Flow Graph Mining From Execution Logs , 2016, KDD.

[63]  Yu He,et al.  Anomaly Detection and Diagnosis for Container-Based Microservices with Performance Monitoring , 2018, ICA3PP.

[64]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[65]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[66]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[67]  Dan Ding,et al.  Delta Debugging Microservice Systems with Parallel Optimization , 2022, IEEE Transactions on Services Computing.

[68]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[69]  Zibin Zheng,et al.  Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments , 2018, ICSOC.

[70]  Sam Shah,et al.  Root cause detection in a service-oriented architecture , 2013, SIGMETRICS '13.

[71]  Johan Tordsson,et al.  Performance Diagnosis in Cloud Microservices Using Deep Learning , 2020, ICSOC Workshops.

[72]  Ying Li,et al.  An Approach for Anomaly Diagnosis Based on Hybrid Graph Model with Logs for Distributed Services , 2017, 2017 IEEE International Conference on Web Services (ICWS).

[73]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[74]  Simone Calderara,et al.  Avalanche: an End-to-End Library for Continual Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[75]  Pengfei Chen,et al.  CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[76]  Pengfei Chen,et al.  CauseInfer: Automated End-to-End Performance Diagnosis with Hierarchical Causality Graph in Cloud Environment , 2019, IEEE Transactions on Services Computing.

[77]  Yuanpeng Zhu,et al.  An Anomaly Detection Algorithm for Microservice Architecture Based on Robust Principal Component Analysis , 2020, IEEE Access.

[78]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[79]  Qiang Fu,et al.  Correlating events with time series for incident diagnosis , 2014, KDD.

[80]  Omer F. Rana,et al.  Characterising resource management performance in Kubernetes , 2018, Comput. Electr. Eng..

[81]  Xiaofeng He,et al.  ?-Diagnosis: Unsupervised and Real-time Diagnosis of Small- window Long-tail Latency in Large-scale Microservice Platforms , 2019, WWW.

[82]  Claus Pahl,et al.  DLA: Detecting and Localizing Anomalies in Containerized Microservice Architectures Using Markov Models , 2019, 2019 7th International Conference on Future Internet of Things and Cloud (FiCloud).

[83]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[84]  Qingfeng Du,et al.  A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications , 2020, Applied Sciences.

[85]  Hao Huang,et al.  Streaming Anomaly Detection Using Randomized Matrix Sketching , 2015, Proc. VLDB Endow..

[86]  Johan Tordsson,et al.  MicroRCA: Root Cause Localization of Performance Issues in Microservices , 2020, NOMS 2020 - 2020 IEEE/IFIP Network Operations and Management Symposium.