Identifying linked incidents in large-scale online service systems

In large-scale online service systems, incidents occur frequently due to a variety of causes, from updates of software and hardware to changes in operation environment. These incidents could significantly degrade system’s availability and customers’ satisfaction. Some incidents are linked because they are duplicate or inter-related. The linked incidents can greatly help on-call engineers find mitigation solutions and identify the root causes. In this work, we investigate the incidents and their links in a representative real-world incident management (IcM) system. Based on the identified indicators of linked incidents, we further propose LiDAR (Linked Incident identification with DAta-driven Representation), a deep learning based approach to incident linking. More specifically, we incorporate the textual description of incidents and structural information extracted from historical linked incidents to identify possible links among a large number of incidents. To show the effectiveness of our method, we apply our method to a real-world IcM system and find that our method outperforms other state-of-the-art methods.

[1]  Nachiappan Nagappan,et al.  Predicting Subsystem Failures using Dependency Graph Complexities , 2007, The 18th IEEE International Symposium on Software Reliability (ISSRE '07).

[2]  Junjie Chen,et al.  Static duplicate bug-report identification for compilers , 2019, SCIENTIA SINICA Informationis.

[3]  Andreas Haeberlen,et al.  Automated Bug Removal for Software-Defined Networks , 2017, NSDI.

[4]  Priya Narasimhan,et al.  Failure Diagnosis of Complex Systems , 2012, Resilience Assessment and Evaluation of Computing Systems.

[5]  Mona Attariyan,et al.  X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software , 2012, OSDI.

[6]  Nicholas A. Kraft,et al.  New features for duplicate bug detection , 2014, MSR 2014.

[7]  Steven X. Ding,et al.  A Survey of Fault Diagnosis and Fault-Tolerant Techniques—Part II: Fault Diagnosis With Knowledge-Based and Hybrid/Active Approaches , 2015, IEEE Transactions on Industrial Electronics.

[8]  Y. Raghu Reddy,et al.  Poster: DWEN: Deep Word Embedding Network for Duplicate Bug Report Detection in Software Repositories , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion).

[9]  Siau-Cheng Khoo,et al.  Towards more accurate retrieval of duplicate bug reports , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[10]  Jian Zhou,et al.  Learning to rank duplicate bug reports , 2012, CIKM.

[11]  Li Xiang,et al.  A survey of intelligent network fault diagnosis technology , 2013, 2013 25th Chinese Control and Decision Conference (CCDC).

[12]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[13]  Les Gasser,et al.  Bug Report Networks: Varieties, Strategies, and Impacts in a F/OSS Development Community , 2004, MSR.

[14]  Malgorzata Steinder,et al.  A survey of fault localization techniques in computer networks , 2004, Sci. Comput. Program..

[15]  Steven X. Ding,et al.  A Survey of Fault Diagnosis and Fault-Tolerant Techniques—Part I: Fault Diagnosis With Model-Based and Signal-Based Approaches , 2015, IEEE Transactions on Industrial Electronics.

[16]  Dongmei Zhang,et al.  An Empirical Investigation of Incident Triage for Online Service Systems , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[17]  Fei Wang,et al.  Defect Prediction Based on the Characteristics of Multilayer Structure of Software Network , 2018, 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C).

[18]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[19]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[20]  David Lo,et al.  Duplicate bug report detection with a combination of information retrieval and topic modeling , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[21]  Zhenchang Xing,et al.  Predicting semantically linkable knowledge in developer online forums via convolutional neural network , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[22]  Junjie Chen,et al.  How Incidental are the Incidents? Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[23]  Cor-Paul Bezemer,et al.  Revisiting the Performance Evaluation of Automated Approaches for the Retrieval of Duplicate Issue Reports , 2018, IEEE Transactions on Software Engineering.

[24]  Pragya Agarwal,et al.  Fault-localization techniques for software systems: a literature review , 2014, SOEN.

[25]  Eleni Stroulia,et al.  A contextual approach towards more accurate duplicate bug report detection and ranking , 2016, 2013 10th Working Conference on Mining Software Repositories (MSR).

[26]  Xinli Yang,et al.  Combining Word Embedding with Information Retrieval to Recommend Similar Bug Reports , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[27]  Razvan C. Bunescu,et al.  Mapping Bug Reports to Relevant Files: A Ranking Model, a Fine-Grained Benchmark, and Feature Evaluation , 2016, IEEE Transactions on Software Engineering.

[28]  Junjie Chen,et al.  Continuous Incident Triage for Large-Scale Online Service Systems , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[29]  K. M. Annervaz,et al.  Towards Accurate Duplicate Bug Retrieval Using Deep Learning Techniques , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[30]  Nachiappan Nagappan,et al.  Predicting defects using network analysis on dependency graphs , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[31]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[32]  Hongyu Zhang,et al.  How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems , 2020, ESEC/SIGSOFT FSE.

[33]  Rui Abreu,et al.  A Survey on Software Fault Localization , 2016, IEEE Transactions on Software Engineering.

[34]  Katinka Wolter,et al.  Resilience Assessment and Evaluation of Computing Systems , 2012, Springer Berlin Heidelberg.

[35]  Qiang Fu,et al.  Software analytics for incident management of online services: An experience report , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[36]  Y. Raghu Reddy,et al.  Poster: LWE: LDA Refined Word Embeddings for Duplicate Bug Report Detection , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion).

[37]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[38]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[39]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[40]  Serge Demeyer,et al.  The Eclipse and Mozilla defect tracking dataset: A genuine dataset for mining bug information , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[41]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.