Towards intelligent incident management: why we need it and how we make it

The management of cloud service incidents (unplanned interruptions or outages of a service/product) greatly affects customer satisfaction and business revenue. After years of efforts, cloud enterprises are able to solve most incidents automatically and timely. However, in practice, we still observe critical service incidents that occurred in an unexpected manner and orchestrated diagnosis workflow failed to mitigate them. In order to accelerate the understanding of unprecedented incidents and provide actionable recommendations, modern incident management system employs the strategy of AIOps (Artificial Intelligence for IT Operations). In this paper, to provide a broad view of industrial incident management and understand the modern incident management system, we conduct a comprehensive empirical study spanning over two years of incident management practices at Microsoft. Particularly, we identify two critical challenges (namely, incomplete service/resource dependencies and imprecise resource health assessment) and investigate the underlying reasons from the perspective of cloud system design and operations. We also present IcM BRAIN, our AIOps framework towards intelligent incident management, and show its practical benefits conveyed to the cloud services of Microsoft.

[1]  Yu Zhang,et al.  Log Clustering Based Problem Identification for Online Service Systems , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[2]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[3]  Haryadi S. Gunawi,et al.  Why Does the Cloud Stop Computing?: Lessons from Hundreds of Service Outages , 2016, SoCC.

[4]  Paramvir Bahl,et al.  Discovering Dependencies for Network Management , 2006, HotNets.

[5]  Peng Huang,et al.  Gray Failure: The Achilles' Heel of Cloud-Scale Systems , 2017, HotOS.

[6]  Xu Zhang,et al.  Robust log-based anomaly detection on unstable log data , 2019, ESEC/SIGSOFT FSE.

[7]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[8]  Yangfan Zhou,et al.  iFeedback: Exploiting User Feedback for Real-Time Issue Detection in Large-Scale Online Service Systems , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[9]  Dongmei Zhang,et al.  An Empirical Investigation of Incident Triage for Online Service Systems , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[10]  Shilin He,et al.  Characterizing the Natural Language Descriptions in Software Logging Statements , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[11]  Harry Wechsler,et al.  A Martingale Framework for Detecting Changes in Data Streams by Testing Exchangeability , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Dongmei Zhang,et al.  iDice: Problem Identification for Emerging Issues , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[13]  Hang Dong,et al.  Outage Prediction and Diagnosis for Cloud Service Systems , 2019, WWW.

[14]  Chan-Gun Lee,et al.  Applying deep learning based automatic bug triager to industrial projects , 2017, ESEC/SIGSOFT FSE.

[15]  Hao Hu,et al.  Effective Bug Triage Based on Historical Bug-Fix Information , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[16]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[17]  Peng Huang,et al.  AIOps: Real-World Challenges and Research Innovations , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion).

[18]  Fabio Casati,et al.  Toward Web Service Dependency Discovery for SOA Management , 2008, 2008 IEEE International Conference on Services Computing.

[19]  Haoxiang Lin,et al.  An Empirical Study on Quality Issues of Production Big Data Platform , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[20]  Qiang Fu,et al.  Software analytics for incident management of online services: An experience report , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[21]  Zhaohui Wu,et al.  CloudScout: A Non-Intrusive Approach to Service Dependency Discovery , 2017, IEEE Transactions on Parallel and Distributed Systems.

[22]  Tong Zhang,et al.  Deep Pyramid Convolutional Neural Networks for Text Categorization , 2017, ACL.

[23]  Junjie Chen,et al.  Continuous Incident Triage for Large-Scale Online Service Systems , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[24]  Qiang Fu,et al.  Mining dependency in distributed systems through unstructured logs analysis , 2010, OPSR.

[25]  Liang Gong,et al.  Predicting bug-fixing time: An empirical study of commercial software projects , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[26]  Dongmei Zhang,et al.  Predicting Node failure in cloud service systems , 2018, ESEC/SIGSOFT FSE.

[27]  Xu Zhang,et al.  Cross-dataset Time Series Anomaly Detection for Cloud Systems , 2019, USENIX Annual Technical Conference.

[28]  Xu Chen,et al.  Automating Network Application Dependency Discovery: Experiences, Limitations, and New Solutions , 2008, OSDI.

[29]  Qiang Fu,et al.  Mining Historical Issue Repositories to Heal Large-Scale Online Service Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[30]  Shilin He,et al.  Experience Report: System Log Analysis for Anomaly Detection , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[31]  Dongmei Zhang,et al.  Identifying impactful service system problems via log analysis , 2018, ESEC/SIGSOFT FSE.

[32]  Domenico Cotroneo,et al.  What Logs Should You Look at When an Application Fails? Insights from an Industrial Case Study , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[33]  Teodor-Florin Fortis,et al.  Cloud Incident Management, Challenges, Research Directions, and Architectural Approach , 2014, 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing.

[34]  Xin Peng,et al.  A learning-based approach for automatic construction of domain glossary from source code and documentation , 2019, ESEC/SIGSOFT FSE.

[35]  Aaron B. Brown,et al.  An active approach to characterizing dynamic dependencies for problem determination in a distributed environment , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[36]  Peng Li,et al.  Improving Service Availability of Cloud Systems by Predicting Disk Error , 2018, USENIX ATC.

[37]  Alexander Gammerman,et al.  Plug-in martingales for testing exchangeability on-line , 2012, ICML.

[38]  Qiang Fu,et al.  Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis , 2009, 2009 Ninth IEEE International Conference on Data Mining.