Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-based Platforms

Microsoft's internal big data analytics platform is comprised of hundreds of thousands of machines, serving over half a million jobs daily, from thousands of users. The majority of these jobs are recurring and are crucial for the company's operation. Although administrators spend significant effort tuning system performance, some jobs inevitably experience slowdowns, i.e., their execution time degrades over previous runs. Currently, the investigation of such slowdowns is a labor-intensive and error-prone process, which costs Microsoft significant human and machine resources, and which negatively impacts several lines of business. In this work, we present Griffon, a system we built and have deployed in our production analytics clusters since last year to automatically discover the root cause of job slowdowns. Most existing solutions rely on labeled data (i.e., resolved incidents with labeled reasons for job slowdowns), which is in most practical scenarios non-existent or non-trivial to acquire. Others rely on time-series analysis of individual metrics that do not target specific jobs holistically. In contrast, in Griffon we cast the problem to a corresponding regression one that predicts the runtime of a job, and we show how the relative contributions of the features used to train our interpretable model can be exploited to rank the potential causes of job slowdowns. Evaluated over historical incidents, we show that Griffon discovers slowdown causes that are consistent with the ones validated by domain-expert engineers in a fraction of the time required by them.

[1]  Carlo Curino,et al.  Peering through the Dark: An Owl's View of Inter-job Dependencies and Jobs' Impact in Shared Clusters , 2019, SIGMOD Conference.

[2]  Haixun Wang,et al.  Adaptive system anomaly prediction for large-scale hosting infrastructures , 2010, PODC.

[3]  Filip Karlo Dosilovic,et al.  Explainable artificial intelligence: A survey , 2018, 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[4]  Gang Yin,et al.  Magnifier: Online Detection of Performance Problems in Large-Scale Cloud Computing Systems , 2011, 2011 IEEE International Conference on Services Computing.

[5]  Kamesh Munagala,et al.  Fa: A System for Automating Failure Diagnosis , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[6]  Peter R. Pietzuch,et al.  Medea: scheduling of long running applications in shared production clusters , 2018, EuroSys.

[7]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[8]  Line H. Clemmensen,et al.  Forest Floor Visualizations of Random Forests , 2016, ArXiv.

[9]  Yongxin Zhu,et al.  An Intelligent Anomaly Detection and Reasoning Scheme for VM Live Migration via Cloud Data Mining , 2013, 2013 IEEE 25th International Conference on Tools with Artificial Intelligence.

[10]  Nicolas Bruno,et al.  SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[11]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[12]  Shikha Agrawal,et al.  Survey on Anomaly Detection using Data Mining Techniques , 2015, KES.

[13]  Evgenia Smirni,et al.  Automated anomaly detection and performance modeling of enterprise applications , 2009, TOCS.

[14]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[15]  Xiaohui Gu,et al.  PAL: Propagation-aware Anomaly Localization for cloud hosted distributed applications , 2011, SLAML '11.

[16]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[17]  Xiaohui Gu,et al.  UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems , 2012, ICAC '12.

[18]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[19]  Klaus-Robert Müller,et al.  Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models , 2017, ArXiv.

[20]  Nicolai Meinshausen,et al.  Quantile Regression Forests , 2006, J. Mach. Learn. Res..

[21]  Muttukrishnan Rajarajan,et al.  A survey of intrusion detection techniques in Cloud , 2013, J. Netw. Comput. Appl..

[22]  Haixun Wang,et al.  Online Anomaly Prediction for Robust Cluster Systems , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[23]  Carlo Curino,et al.  Hydra: a federated resource manager for data-center scale analytics , 2019, NSDI.

[24]  Xiaohui Gu,et al.  PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[25]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[26]  Huang Chuanhe,et al.  Anomaly Based Intrusion Detection Using Hybrid Learning Approach of Combining k-Medoids Clustering and Naïve Bayes Classification , 2012, 2012 8th International Conference on Wireless Communications, Networking and Mobile Computing.

[27]  Gang Yin,et al.  Performance problems diagnosis in cloud computing systems by mining request trace logs , 2012, 2012 IEEE Network Operations and Management Symposium.

[28]  Kanishka Bhaduri,et al.  Detecting Abnormal Machine Characteristics in Cloud Infrastructures , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[29]  R. Kennedy,et al.  Defense Advanced Research Projects Agency (DARPA). Change 1 , 1996 .