Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure

Modern cloud systems have a vast number of components that continuously undergo changes. Deploying these frequent updates quickly without breaking the system is challenging. In this paper, we present Gandalf, an end-to-end analytics service for safe deployment in a large-scale system infrastructure. Gandalf enables rapid and robust impact assessment of software rollouts to catch bad rollouts before they cause widespread outages. Gandalf monitors and analyzes various fault signals and correlates each signal against all the ongoing rollouts using a spatial and temporal correlation algorithm. Its core decision logic includes an ensemble ranking algorithm that determines which rollout caused the fault signals and a binary classifier that assesses the impact of the fault signals. The analysis result determines whether a rollout is safe to proceed or should be stopped. By using a lambda architecture, Gandalf provides both real-time and long-term deployment monitoring with automated decisions and notifications. Gandalf has been running in production in Microsoft Azure for more than 18 months, serving both data-plane and control-plane components. It achieves 92.4% precision and 100% recall (no high-impact service outages in Azure Compute were caused by bad rollouts) for data-plane rollouts. For control-plane rollouts, Gandalf achieves 94.87% precision and 99.84% recall.

[1]  Michael Hamburg,et al.  Spectre Attacks: Exploiting Speculative Execution , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[2]  Qiang Fu,et al.  Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[3]  Sudheendra Hangal,et al.  Tracking down software bugs using automatic anomaly detection , 2002, ICSE '02.

[4]  Ali Ghodsi,et al.  Drizzle: Fast and Adaptable Stream Processing at Scale , 2017, SOSP.

[5]  K. Pearson VII. Note on regression and inheritance in the case of two parents , 1895, Proceedings of the Royal Society of London.

[6]  Qiang Fu,et al.  Performance Issue Diagnosis for Online Service Systems , 2012, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[7]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[8]  H. Abdi Discriminant Correspondence Analysis , 2006 .

[9]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[10]  Qiang Fu,et al.  Correlating events with time series for incident diagnosis , 2014, KDD.

[11]  Peng Huang,et al.  13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8-10, 2018 , 2018, OSDI.

[12]  Wilhelm Hasselbring,et al.  Including Performance Benchmarks into Continuous Integration to Enable DevOps , 2015, SOEN.

[13]  Chris Chatfield,et al.  The Holt-Winters Forecasting Procedure , 1978 .

[14]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[15]  Vanish Talwar,et al.  Online detection of utility cloud anomalies using metric distributions , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[16]  Qiang Fu,et al.  Mining program workflow from interleaved traces , 2010, KDD.

[17]  Saeed Amizadeh,et al.  Generic and Scalable Framework for Automated Time-series Anomaly Detection , 2015, KDD.

[18]  Peng Huang,et al.  Comprehensive and Efficient Runtime Checking in System Software through Watchdogs , 2019, HotOS.

[19]  D. Lawley A GENERALIZATION OF FISHER'S z TEST , 1938 .

[20]  Wei Lin,et al.  StreamScope: Continuous Reliable Distributed Processing of Big Data Streams , 2016, NSDI.

[21]  Peng Huang,et al.  Gray Failure: The Achilles' Heel of Cloud-Scale Systems , 2017, HotOS.

[22]  Dongmei Zhang,et al.  Identifying impactful service system problems via log analysis , 2018, ESEC/SIGSOFT FSE.

[23]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[24]  Theo Schlossnagle Monitoring in a DevOps world , 2018, CACM.

[25]  Paramvir Bahl,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM '07.

[26]  Haryadi S. Gunawi,et al.  Why Does the Cloud Stop Computing?: Lessons from Hundreds of Service Outages , 2016, SoCC.

[27]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[28]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[29]  Michael Hamburg,et al.  Meltdown: Reading Kernel Memory from User Space , 2018, USENIX Security Symposium.

[30]  Shenglin Zhang,et al.  FUNNEL: Assessing Software Changes in Web-Based Services , 2018, IEEE Transactions on Services Computing.

[31]  Ramayya Krishnan,et al.  Incremental hierarchical clustering of text documents , 2006, CIKM '06.

[32]  Michael J. Freedman,et al.  Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area , 2014, NSDI.

[33]  Robert B. Ross,et al.  Fail-Slow at Scale , 2018, ACM Trans. Storage.

[34]  Paramvir Bahl,et al.  Detailed diagnosis in enterprise networks , 2009, SIGCOMM '09.