Identifying bad software changes via multimodal anomaly detection for online service systems

In large-scale online service systems, software changes are inevitable and frequent. Due to importing new code or configurations, changes are likely to incur incidents and destroy user experience. Thus it is essential for engineers to identify bad software changes, so as to reduce the influence of incidents and improve system re- liability. To better understand bad software changes, we perform the first empirical study based on large-scale real-world data from a large commercial bank. Our quantitative analyses indicate that about 50.4% of incidents are caused by bad changes, mainly be- cause of code defect, configuration error, resource contention, and software version. Besides, our qualitative analyses show that the current practice of detecting bad software changes performs not well to handle heterogeneous multi-source data involved in soft- ware changes. Based on the findings and motivation obtained from the empirical study, we propose a novel approach named SCWarn aiming to identify bad changes and produce interpretable alerts accurately and timely. The key idea of SCWarn is drawing support from multimodal learning to identify anomalies from heterogeneous multi-source data. An extensive study on two datasets with various bad software changes demonstrates our approach significantly outperforms all the compared approaches, achieving 0.95 F1-score on average and reducing MTTD (mean time to detect) by 20.4%∼60.7%. In particular, we shared some success stories and lessons learned from the practical usage.

[1]  Paolo Tonella,et al.  Misbehaviour Prediction for Autonomous Driving Systems , 2019, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[2]  Daniel M. Dunlavy,et al.  Multimodal Deep Learning for Flaw Detection in Software Programs , 2020, ArXiv.

[3]  Junjie Chen,et al.  How Incidental are the Incidents? Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[4]  Minyi Guo,et al.  Unleashing the Scalability Potential of Power-Constrained Data Center in the Microservice Era , 2019, ICPP.

[5]  Yang Feng,et al.  Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications , 2018, WWW.

[6]  Yu Kang,et al.  Towards intelligent incident management: why we need it and how we make it , 2020, ESEC/SIGSOFT FSE.

[7]  Chao Yi,et al.  Time-Series Anomaly Detection Service at Microsoft , 2019, KDD.

[8]  Dan Ding,et al.  Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study , 2018, IEEE Transactions on Software Engineering.

[9]  Valentino Constantinou,et al.  Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding , 2018, KDD.

[10]  Nikita Povarov,et al.  Using Large-Scale Anomaly Detection on Code to Improve Kotlin Compiler , 2020, MSR.

[11]  A. Viera,et al.  Understanding interobserver agreement: the kappa statistic. , 2005, Family medicine.

[12]  Yin Zhang,et al.  Detecting the performance impact of upgrades in large operational networks , 2010, SIGCOMM '10.

[13]  Lei Zhang,et al.  Anomaly Detection in a Large-Scale Cloud Platform , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[14]  Shenglin Zhang,et al.  Rapid and robust impact assessment of software changes in large internet-based services , 2015, CoNEXT.

[15]  Minghe Yu,et al.  AIBench: An Industry Standard Internet Service AI Benchmark Suite , 2019, ArXiv.

[16]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[17]  Xin Huang,et al.  Robust and Rapid Adaption for Concept Drift in Software System Anomaly Detection , 2018, 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE).

[18]  Odej Kao,et al.  Anomaly Detection from System Tracing Data Using Multimodal Deep Learning , 2019, 2019 IEEE 12th International Conference on Cloud Computing (CLOUD).

[19]  Dawn Xiaodong Song,et al.  Lifelong Anomaly Detection Through Unlearning , 2019, CCS.

[20]  Daniel Massey,et al.  Argus: End-to-end service anomaly detection and localization from an ISP's point of view , 2012, 2012 Proceedings IEEE INFOCOM.

[21]  Wei Sun,et al.  Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network , 2019, KDD.

[22]  Shenglin Zhang,et al.  LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs , 2019, IJCAI.

[23]  Zhou Wang,et al.  Real-time incident prediction for online service systems , 2020, ESEC/SIGSOFT FSE.

[24]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[25]  Aitor Gartziandia Microservice-Based Performance Problem Detection in Cyber-Physical System Software Updates , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion).

[26]  Haryadi S. Gunawi,et al.  Why Does the Cloud Stop Computing?: Lessons from Hundreds of Service Outages , 2016, SoCC.

[27]  Shenglin Zhang,et al.  Unsupervised Detection of Microservice Trace Anomalies through Service-Level Deep Bayesian Networks , 2020, 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE).

[28]  Murali Chintalapati,et al.  Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure , 2020, NSDI.

[29]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[30]  Johannes Gehrke,et al.  Lumos: A Library for Diagnosing Metric Regressions in Web-Scale Applications , 2020, KDD.

[31]  Ranjita Bhagwan,et al.  Rex: Preventing Bugs and Misconfiguration in Large Services Using Correlated Change Analysis , 2020, NSDI.

[32]  Ruzica Piskac,et al.  Check before You Change: Preventing Correlated Failures in Service Updates , 2020, NSDI.

[33]  Dan Pei,et al.  Automatically and Adaptively Identifying Severe Alerts for Online Service Systems , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[34]  Steffen Lehnert,et al.  A taxonomy for software change impact analysis , 2011, IWPSE-EVOL '11.

[35]  WangSheng,et al.  Diagnosing root causes of intermittent slow queries in cloud databases , 2020, VLDB 2020.

[36]  Gargi Dasgupta,et al.  Anomaly Detection Using Program Control Flow Graph Mining From Execution Logs , 2016, KDD.

[37]  Feifei Li,et al.  DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning , 2017, CCS.

[38]  Yue Jia,et al.  Sapienz: multi-objective automated testing for Android applications , 2016, ISSTA.

[39]  Niall Murphy,et al.  Site Reliability Engineering: How Google Runs Production Systems , 2016 .

[40]  Steffen Lehnert,et al.  A review of software change impact analysis , 2011 .

[41]  Lingming Zhang,et al.  Practical Accuracy Estimation for Efficient Deep Neural Network Testing , 2020, ACM Trans. Softw. Eng. Methodol..

[42]  Dongmei Zhang,et al.  An Empirical Investigation of Incident Triage for Online Service Systems , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[43]  Graham W. Taylor,et al.  Deep Multimodal Learning: A Survey on Recent Advances and Trends , 2017, IEEE Signal Processing Magazine.

[44]  Shenglin Zhang,et al.  PreFix: Switch Failure Prediction in Datacenter Networks , 2018, Proc. ACM Meas. Anal. Comput. Syst..

[45]  Christian Berger,et al.  Towards Structured Evaluation of Deep Neural Network Supervisors , 2019, 2019 IEEE International Conference On Artificial Intelligence Testing (AITest).

[46]  Xu Zhang,et al.  Robust log-based anomaly detection on unstable log data , 2019, ESEC/SIGSOFT FSE.

[47]  Sarfraz Khurshid,et al.  DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[48]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[49]  Junjie Chen,et al.  Continuous Incident Triage for Large-Scale Online Service Systems , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[50]  Tao Wang,et al.  Workflow-Aware Automatic Fault Diagnosis for Microservice-Based Applications With Statistics , 2020, IEEE Transactions on Network and Service Management.

[51]  Yu Zhang,et al.  Log Clustering Based Problem Identification for Online Service Systems , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[52]  Zibin Zheng,et al.  Drain: An Online Log Parsing Approach with Fixed Depth Tree , 2017, 2017 IEEE International Conference on Web Services (ICWS).

[53]  Hongyu Zhang,et al.  How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems , 2020, ESEC/SIGSOFT FSE.

[54]  Yin Zhang,et al.  Rapid detection of maintenance induced changes in service performance , 2011, CoNEXT '11.

[55]  Dongmei Zhang,et al.  Predicting Node failure in cloud service systems , 2018, ESEC/SIGSOFT FSE.

[56]  Peng Li,et al.  Improving Service Availability of Cloud Systems by Predicting Disk Error , 2018, USENIX ATC.

[57]  Pu Zhao,et al.  Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions , 2020, OSDI.

[58]  Zibin Zheng,et al.  Tools and Benchmarks for Automated Log Parsing , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[59]  Shilin He,et al.  Experience Report: System Log Analysis for Anomaly Detection , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[60]  Qiang Fu,et al.  Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[61]  Dan Pei,et al.  Automatic and Generic Periodicity Adaptation for KPI Anomaly Detection , 2019, IEEE Transactions on Network and Service Management.

[62]  Hang Dong,et al.  Identifying linked incidents in large-scale online service systems , 2020, ESEC/SIGSOFT FSE.

[63]  A. Vargha,et al.  A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong , 2000 .

[64]  Xuyuan Dong,et al.  Semi-Supervised Log-Based Anomaly Detection via Probabilistic Label Estimation , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).