Experience report on applying software analytics in incident management of online service

As online services become more and more popular, incident management has become a critical task that aims to minimize the service downtime and to ensure high quality of the provided services. In practice, incident management is conducted through analyzing a huge amount of monitoring data collected at runtime of a service. Such data-driven incident management faces several significant challenges such as the large data scale, complex problem space, and incomplete knowledge. To address these challenges, we carried out 2-year software-analytics research where we designed a set of novel data-driven techniques and developed an industrial system called the Service Analysis Studio (SAS) targeting real scenarios in a large-scale online service of Microsoft. SAS has been deployed to worldwide product datacenters and widely used by on-call engineers for incident management. This paper shares our experience about using software analytics to solve engineers pain points in incident management, the developed data-analysis techniques, and the lessons learned from the process of research development and technology transfer.

[1]  Qiang Fu,et al.  Healing online service systems via mining historical issue repositories , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[2]  Dongmei Zhang,et al.  Software Analytics in Practice , 2013, IEEE Software.

[3]  Dongmei Zhang,et al.  XIAO: tuning code clones at hands of engineers in practice , 2012, ACSAC '12.

[4]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[5]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[6]  Wei-Ying Ma,et al.  Automated known problem diagnosis with event traces , 2006, EuroSys.

[7]  Forrest Shull,et al.  Local versus Global Lessons for Defect Prediction and Effort Estimation , 2013, IEEE Transactions on Software Engineering.

[8]  Hongzhe Li,et al.  A hidden Markov random field model for genome-wide association studies. , 2010, Biostatistics.

[9]  Peggy Cellier Formal concept analysis applied to fault localization , 2008, ICSE Companion '08.

[10]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[11]  Jennifer Neville,et al.  Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems , 2012, NSDI.

[12]  Dongmei Zhang,et al.  Software analytics as a learning case in practice: approaches and experiences , 2011, MALETS '11.

[13]  David A. Patterson,et al.  A Simple Way to Estimate the Cost of Downtime , 2002, LISA.

[14]  Shivnath Babu,et al.  Guided Problem Diagnosis through Active Learning , 2008, 2008 International Conference on Autonomic Computing.

[15]  Qiang Fu,et al.  Contextual analysis of program logs for understanding system behaviors , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[16]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[17]  Hong Shen,et al.  Mining Optimal Class Association Rule Set , 2001, PAKDD.

[18]  Chao Liu,et al.  SOBER: statistical model-based bug localization , 2005, ESEC/FSE-13.

[19]  Sangameshwar Patil,et al.  Automated debugging of SLO violations in enterprise systems , 2010, 2010 Second International Conference on COMmunication Systems and NETworks (COMSNETS 2010).

[20]  Qiang Fu,et al.  Software analytics for incident management of online services: An experience report , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[21]  Alex Alves Freitas,et al.  Understanding the crucial differences between classification and discovery of association rules: a position paper , 2000, SKDD.

[22]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[23]  T. Abdelzaher,et al.  Achieving Scalable Automated Diagnosis of Distributed Systems Performance Problems , 2007 .

[24]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[25]  Carlo Ghezzi,et al.  Change-point detection for black-box services , 2010, FSE '10.

[26]  Sriram K. Rajamani,et al.  DebugAdvisor: a recommender system for debugging , 2009, ESEC/FSE '09.

[27]  Qiang Fu,et al.  Where do developers log? an empirical study on logging practices in industry , 2014, ICSE Companion.

[28]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[29]  Dongmei Zhang,et al.  iDice: Problem Identification for Emerging Issues , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[30]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[31]  Qiang Fu,et al.  Mining Historical Issue Repositories to Heal Large-Scale Online Service Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[32]  Qiang Fu,et al.  Mining Invariants from Console Logs for System Problem Detection , 2010, USENIX Annual Technical Conference.

[33]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[34]  Dongmei Zhang,et al.  Performance debugging in the large via mining millions of stack traces , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[35]  Qiang Fu,et al.  YADING: Fast Clustering of Large-Scale Time Series Data , 2015, Proc. VLDB Endow..

[36]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[37]  Chita R. Das,et al.  Modeling and synthesizing task placement constraints in Google compute clusters , 2011, SoCC.

[38]  Qiang Fu,et al.  Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[39]  Andreas Zeller,et al.  Extrinsic influence factors in software reliability: a study of 200,000 windows machines , 2014, ICSE Companion.

[40]  Brendan Murphy,et al.  Characterizing the differences between pre- and post- release versions of software , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[41]  Qiang Fu,et al.  Identifying Recurrent and Unknown Performance Issues , 2014, 2014 IEEE International Conference on Data Mining.

[42]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[43]  Qiang Fu,et al.  Performance Issue Diagnosis for Online Service Systems , 2012, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[44]  Dongmei Zhang,et al.  Software analytics in practice: mini tutorial , 2012, ICSE '12.

[45]  Armando Fox,et al.  Ensembles of models for automated diagnosis of system performance problems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).