Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Cloud-Scale Infrastructure

Modern cloud systems have a vast number of components that continuously undergo updates. Deploying these frequent updates quickly without breaking the system is challenging. In this paper, we present Gandalf, an end-to-end analytics service for safe deployment in a large-scale system infrastructure. Gandalf enables rapid and robust impact assessment of software rollouts to catch bad rollouts before they cause widespread outages. Gandalf monitors and analyzes various fault signals. It will correlate each signal against all the ongoing rollouts using a spatial and temporal correlation algorithm. The core decision logic of Gandalf includes an ensemble ranking algorithm that determines which rollout may have caused the fault signals, and a binary classifier that assesses the impact of the fault signals. The analysis result will decide whether a rollout is safe to proceed or should be stopped. By using a lambda architecture, Gandalf provides both realtime and long-term deployment monitoring with automated decisions and notifications. Gandalf has been running in production in Microsoft Azure for more than 18 months, serving both data-plane and control-plane components. It achieves 92.4% precision and 100% recall (no high-impact service outages in Azure Compute were caused by bad rollouts) for dataplane rollouts. For control-plane rollouts, Gandalf achieves 94.9% precision and 99.8% recall.

[1]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[2]  K. Pearson VII. Note on regression and inheritance in the case of two parents , 1895, Proceedings of the Royal Society of London.

[3]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[4]  Wei Lin,et al.  StreamScope: Continuous Reliable Distributed Processing of Big Data Streams , 2016, NSDI.

[5]  Dongmei Zhang,et al.  Identifying impactful service system problems via log analysis , 2018, ESEC/SIGSOFT FSE.

[6]  Qiang Fu,et al.  Mining program workflow from interleaved traces , 2010, KDD.

[7]  Peng Huang,et al.  Comprehensive and Efficient Runtime Checking in System Software through Watchdogs , 2019, HotOS.

[8]  Ali Ghodsi,et al.  Drizzle: Fast and Adaptable Stream Processing at Scale , 2017, SOSP.

[9]  Peng Huang,et al.  Gray Failure: The Achilles' Heel of Cloud-Scale Systems , 2017, HotOS.

[10]  Haryadi S. Gunawi,et al.  Why Does the Cloud Stop Computing?: Lessons from Hundreds of Service Outages , 2016, SoCC.

[11]  Qiang Fu,et al.  Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[12]  D. Lawley A GENERALIZATION OF FISHER'S z TEST , 1938 .

[13]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[14]  Qiang Fu,et al.  Correlating events with time series for incident diagnosis , 2014, KDD.

[15]  Theo Schlossnagle Monitoring in a DevOps world , 2018, CACM.

[16]  Saeed Amizadeh,et al.  Generic and Scalable Framework for Automated Time-series Anomaly Detection , 2015, KDD.

[17]  Wilhelm Hasselbring,et al.  Including Performance Benchmarks into Continuous Integration to Enable DevOps , 2015, SOEN.

[18]  A. Greenberg,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM '07.

[19]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[20]  Peng Huang,et al.  13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018 , 2018, OSDI.

[21]  Robert B. Ross,et al.  Fail-Slow at Scale , 2018, ACM Trans. Storage.

[22]  H. Abdi Discriminant Correspondence Analysis , 2006 .

[23]  Sudheendra Hangal,et al.  Tracking down software bugs using automatic anomaly detection , 2002, ICSE '02.

[24]  Paramvir Bahl,et al.  Detailed diagnosis in enterprise networks , 2009, SIGCOMM '09.

[25]  Michael Hamburg,et al.  Meltdown: Reading Kernel Memory from User Space , 2018, USENIX Security Symposium.

[26]  Qiang Fu,et al.  Performance Issue Diagnosis for Online Service Systems , 2012, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[27]  Vanish Talwar,et al.  Online detection of utility cloud anomalies using metric distributions , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[28]  Michael Hamburg,et al.  Spectre Attacks: Exploiting Speculative Execution , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[29]  Ramayya Krishnan,et al.  Incremental hierarchical clustering of text documents , 2006, CIKM '06.

[30]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[31]  Michael J. Freedman,et al.  Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area , 2014, NSDI.

[32]  Shenglin Zhang,et al.  FUNNEL: Assessing Software Changes in Web-Based Services , 2018, IEEE Transactions on Services Computing.

[33]  Chris Chatfield,et al.  The Holt-Winters Forecasting Procedure , 1978 .