Graph-based Incident Aggregation for Large-Scale Online Service Systems

As online service systems continue to grow in terms of complexity and volume, how service incidents are managed will significantly impact company revenue and user trust. Due to the cascading effect, cloud failures often come with an overwhelming number of incidents from dependent services and devices. To pursue efficient incident management, related incidents should be quickly aggregated to narrow down the problem scope. To this end, in this paper, we propose GRLIA, an incident aggregation framework based on graph representation learning over the cascading graph of cloud failures. A representation vector is learned for each unique type of incident in an unsupervised and unified manner, which is able to simultaneously encode the topological and temporal correlations among incidents. Thus, it can be easily employed for online incident aggregation. In particular, to learn the correlations more accurately, we try to recover the complete scope of failures’ cascading impact by leveraging fine-grained system monitoring data, i.e., Key Performance Indicators (KPIs). The proposed framework is evaluated with real-world incident data collected from a large-scale online service system of Huawei Cloud. The experimental results demonstrate that GRLIA is effective and outperforms existing methods. Furthermore, our framework has been successfully deployed in industrial practice.

[1]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[2]  Hang Dong,et al.  Identifying linked incidents in large-scale online service systems , 2020, ESEC/SIGSOFT FSE.

[3]  Leon Moonen,et al.  Improving problem identification via automated log clustering using dimensionality reduction , 2018, ESEM.

[4]  Ping Wang,et al.  Lightweight and Adaptive Service API Performance Monitoring in Highly Dynamic Cloud Environment , 2017, 2017 IEEE International Conference on Services Computing (SCC).

[5]  Zhao Yang,et al.  A Comparative Analysis of Community Detection Algorithms on Artificial Networks , 2016, Scientific Reports.

[6]  Dongmei Zhang,et al.  An Empirical Investigation of Incident Triage for Online Service Systems , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[7]  Xiaohui Nie,et al.  Understanding and Handling Alert Storm for Online Service Systems , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[8]  Eamonn Keogh Exact Indexing of Dynamic Time Warping , 2002, VLDB.

[9]  Zhuangbin Chen,et al.  AIOps Innovations in Incident Management for Cloud Services , 2020 .

[10]  Regunathan Radhakrishnan,et al.  Unveiling clusters of events for alert and incident management in large-scale enterprise it , 2014, KDD.

[11]  Zhiyuan Liu,et al.  Graph Neural Networks: A Review of Methods and Applications , 2018, AI Open.

[12]  Haifeng Chen,et al.  Ranking the importance of alerts for problem determination in large computer systems , 2009, ICAC '09.

[13]  Sushil Jajodia,et al.  NSDMiner: Automated discovery of Network Service Dependencies , 2012, 2012 Proceedings IEEE INFOCOM.

[14]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[15]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[16]  Zhou Wang,et al.  Real-time incident prediction for online service systems , 2020, ESEC/SIGSOFT FSE.

[17]  Shenglin Zhang,et al.  FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation , 2019, 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE).

[18]  Shang-Pin Ma,et al.  Using Service Dependency Graph to Analyze and Test Microservices , 2018, 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC).

[19]  Jure Leskovec,et al.  Representation Learning on Graphs: Methods and Applications , 2017, IEEE Data Eng. Bull..

[20]  Dan Pei,et al.  Automatically and Adaptively Identifying Severe Alerts for Online Service Systems , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[21]  Zibin Zheng,et al.  Logzip: Extracting Hidden Structures via Iterative Clustering for Log Compression , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[22]  Junjie Chen,et al.  Continuous Incident Triage for Large-Scale Online Service Systems , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[23]  L. Haan,et al.  Extreme value theory : an introduction , 2006 .

[24]  Qingwei Lin,et al.  Efficient incident identification from multi-dimensional issue reports via meta-heuristic search , 2020, ESEC/SIGSOFT FSE.

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26]  Yu Kang,et al.  Towards intelligent incident management: why we need it and how we make it , 2020, ESEC/SIGSOFT FSE.

[27]  Peng Huang,et al.  Gray Failure: The Achilles' Heel of Cloud-Scale Systems , 2017, HotOS.

[28]  Feifei Li,et al.  Adaptive log compression for massive log data , 2013, SIGMOD '13.

[29]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[30]  Behnaz Arzani,et al.  Scouts: Improving the Diagnosis Process Through Domain-customized Incident Routing , 2020, SIGCOMM.

[31]  Qiang Fu,et al.  Identifying Recurrent and Unknown Performance Issues , 2014, 2014 IEEE International Conference on Data Mining.

[32]  Yu Zhang,et al.  Log Clustering Based Problem Identification for Online Service Systems , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[33]  Dongmei Zhang,et al.  Identifying impactful service system problems via log analysis , 2018, ESEC/SIGSOFT FSE.

[34]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[35]  Michael J. Kavis,et al.  Architecting the Cloud: Design Decisions for Cloud Computing Service Models (Saas, Paas, and Iaas) , 2014 .

[36]  Valentino Constantinou,et al.  Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding , 2018, KDD.

[37]  Hang Dong,et al.  Outage Prediction and Diagnosis for Cloud Service Systems , 2019, WWW.

[38]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[39]  Shilin He,et al.  Characterizing the Natural Language Descriptions in Software Logging Statements , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[40]  Matthijs Douze,et al.  FastText.zip: Compressing text classification models , 2016, ArXiv.

[41]  Dongmei Zhang,et al.  Predicting Node failure in cloud service systems , 2018, ESEC/SIGSOFT FSE.

[42]  Alexandre Termier,et al.  Anomaly Detection in Streams with Extreme Value Theory , 2017, KDD.