Leveraging Deep Learning to Improve the Performance Predictability of Cloud Microservices

Performance unpredictability is a major roadblock towards cloud adoption, and has performance, cost, and revenue ramifications. Predictable performance is even more critical as cloud services transition from monolithic designs to microservices. Detecting QoS violations after they occur in systems with microservices results in long recovery times, as hotspots propagate and amplify across dependent services. We present Seer, an online cloud performance debugging system that leverages deep learning and the massive amount of tracing data cloud systems collect to learn spatial and temporal patterns that translate to QoS violations. Seer combines lightweight distributed RPC-level tracing, with detailed low-level hardware monitoring to signal an upcoming QoS violation, and diagnose the source of unpredictable performance. Once an imminent QoS violation is detected, Seer notifies the cluster manager to take action to avoid performance degradation altogether. We evaluate Seer both in local clusters, and in large-scale deployments of end-to-end applications built with microservices with hundreds of users. We show that Seer correctly anticipates QoS violations 91% of the time, and avoids the QoS violation to begin with in 84% of cases. Finally, we show that Seer can identify application-level design bugs, and provide insights on how to better architect microservices to achieve predictable performance.

[1]  Hiranya Jayathilaka,et al.  Performance Monitoring and Root Cause Analysis for Cloud-hosted Web Applications , 2017, WWW.

[2]  Florin Ciucu,et al.  Distributed resource management across process boundaries , 2017, SoCC.

[3]  Zhenhuan Gong,et al.  PRESS: PRedictive Elastic ReSource Scaling for cloud systems , 2010, 2010 International Conference on Network and Service Management.

[4]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[5]  Michael M. Swift,et al.  Scheduler-based Defenses against Cross-VM Side-channels , 2014, USENIX Security Symposium.

[6]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[8]  Abdelkader H. Ouda,et al.  Cloud-based DDoS attacks and defenses , 2013, International Conference on Information Society (i-Society 2013).

[9]  Christina Delimitrou,et al.  Tarcil: reconciling scheduling speed and quality in large shared clusters , 2015, SoCC.

[10]  Luiz André Barroso,et al.  Warehouse-Scale Computing: Entering the Teenage Decade , 2011, SIGARCH Comput. Archit. News.

[11]  Christina Delimitrou,et al.  HCloud: Resource-Efficient Provisioning in Shared Cloud Systems , 2016, ASPLOS.

[12]  Majd F. Sakr,et al.  Initial Findings for Provisioning Variation in Cloud Computing , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[13]  Lisandro Zambenedetti Granville,et al.  Monitoring of cloud computing environments: concepts, solutions, trends, and future directions , 2016, SAC.

[14]  Jie Xu,et al.  Reducing late-timing failure at scale:straggler root-cause analysis in cloud datacenters , 2016, DSN 2016.

[15]  Christina Delimitrou,et al.  QoS-Aware scheduling in heterogeneous datacenters with paragon , 2013, TOCS.

[16]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[17]  Yuanyuan Zhou,et al.  Early Detection of Configuration Errors to Reduce Failure Damage , 2016, USENIX Annual Technical Conference.

[18]  Ninghui Sun,et al.  DianNao family , 2016, Commun. ACM.

[19]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[20]  Christina Delimitrou,et al.  Quality-of-service-aware Scheduling in Heterogeneous Datacenters with Paragon Paragon, an Online, Scalable Datacenter Scheduler, Enables Better Cluster Utilization and Per-application Quality-of-service Guarantees by Leveraging Data Mining Techniques That Find Similarities between Known and New Appl , 2014 .

[21]  Padam Kumar,et al.  VM Profile Based Optimized Network Attack Pattern Detection Scheme for DDOS Attacks in Cloud , 2013, SSCC.

[22]  Michael K. Reiter,et al.  Düppel: retrofitting commodity operating systems to mitigate cache side channels in the cloud , 2013, CCS.

[23]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[24]  Jiahai Yang,et al.  Root Cause Analysis of Anomalies of Multitier Services in Public Clouds , 2018, IEEE/ACM Transactions on Networking.

[25]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[26]  Xiao Yu,et al.  CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs , 2016, ASPLOS.

[27]  David M. Nicol,et al.  Denial-of-Service Threat to Hadoop/YARN Clusters with Multi-tenancy , 2014, 2014 IEEE International Congress on Big Data.

[28]  Xiaohui Gu,et al.  CloudScale: elastic resource scaling for multi-tenant cloud systems , 2011, SoCC.

[29]  Matti A. Hiltunen,et al.  An exploration of L2 cache covert channels in virtualized environments , 2011, CCSW '11.

[30]  Amin Vahdat,et al.  Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization , 2018, NSDI.

[31]  Thomas F. Wenisch,et al.  Power management of online data-intensive services , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[32]  Ripal Nathuji,et al.  Exploiting Platform Heterogeneity for Power Efficient Data Centers , 2007, Fourth International Conference on Autonomic Computing (ICAC'07).

[33]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[34]  Christoforos E. Kozyrakis,et al.  Towards energy proportionality for large-scale latency-critical workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[35]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[36]  Christina Delimitrou,et al.  QoS-Aware Admission Control in Heterogeneous Datacenters , 2013, ICAC.

[37]  Kushagra Vaid,et al.  Azure Accelerated Networking: SmartNICs in the Public Cloud , 2018, NSDI.

[38]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[39]  G. Edward Suh,et al.  FPGA-Based Remote Power Side-Channel Attacks , 2018, 2018 IEEE Symposium on Security and Privacy (SP).

[40]  Kaushik Roy,et al.  Incremental Learning in Deep Convolutional Neural Networks Using Partial Network Sharing , 2017, IEEE Access.

[41]  Jerome A. Rolia,et al.  Workload Analysis and Demand Prediction of Enterprise Data Center Applications , 2007, 2007 IEEE 10th International Symposium on Workload Characterization.

[42]  Lingjia Tang,et al.  Heterogeneity in “Homogeneous” Warehouse-Scale Computers: A Performance Opportunity , 2011, IEEE Computer Architecture Letters.

[43]  Daniel Sánchez,et al.  Ubik: efficient cache sharing with strict qos for latency-critical workloads , 2014, ASPLOS.

[44]  Abhinav Srivastava,et al.  CloudSight: A Tenant-Oriented Transparency Framework for Cross-Layer Cloud Troubleshooting , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[45]  Christina Delimitrou,et al.  Amdahl's law for tail latency , 2018, Commun. ACM.

[46]  Ricardo Bianchini,et al.  DeepDive: Transparently Identifying and Managing Performance Interference in Virtualized Environments , 2013, USENIX Annual Technical Conference.

[47]  Christina Delimitrou,et al.  Bolt: I Know What You Did Last Summer... In The Cloud , 2017, ASPLOS.

[48]  Christina Delimitrou,et al.  Seer : Leveraging Big Data to Navigate The Complexity of Cloud Debugging , 2018 .

[49]  Hari Angepat,et al.  Serving DNNs in Real Time at Datacenter Scale with Project Brainwave , 2018, IEEE Micro.

[50]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[51]  Alexandru Iosup,et al.  On the Performance Variability of Production Cloud Services , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[52]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[53]  Jorge-Arnulfo Quiané-Ruiz,et al.  Runtime measurements in the cloud , 2010, Proc. VLDB Endow..

[54]  Krzysztof C. Kiwiel,et al.  Convergence and efficiency of subgradient methods for quasiconvex minimization , 2001, Math. Program..

[55]  Christina Delimitrou,et al.  The Architectural Implications of Cloud Microservices , 2018, IEEE Computer Architecture Letters.

[56]  Michael M. Swift,et al.  A Placement Vulnerability Study in Multi-Tenant Public Clouds , 2015, USENIX Security Symposium.

[57]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[58]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[59]  Yuan He,et al.  An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems , 2019, ASPLOS.

[60]  Christoforos E. Kozyrakis,et al.  Reconciling high server utilization and sub-millisecond quality-of-service , 2014, EuroSys '14.

[61]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[62]  Anees Shaikh,et al.  Performance Isolation and Fairness for Multi-Tenant Cloud Storage , 2012, OSDI.

[63]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[64]  Daniel Sánchez,et al.  Tailbench: a benchmark suite and evaluation methodology for latency-critical applications , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[65]  Gang Ren,et al.  Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.

[66]  Lingjia Tang,et al.  Whare-map: heterogeneity in "homogeneous" warehouse-scale computers , 2013, ISCA.

[67]  Alexandru Iosup,et al.  A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing , 2009, CloudComp.

[68]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[69]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[70]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[71]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[72]  Hari Angepat,et al.  A cloud-scale acceleration architecture , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[73]  R. Weisberg A-N-D , 2011 .

[74]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[75]  Xiaohui Gu,et al.  AGILE: Elastic Distributed Resource Scaling for Infrastructure-as-a-Service , 2013, ICAC.

[76]  Aman Kansal,et al.  Q-clouds: managing performance interference effects for QoS-aware clouds , 2010, EuroSys '10.

[77]  Robert M. Bell,et al.  The BellKor 2008 Solution to the Netflix Prize , 2008 .

[78]  Shantenu Jha,et al.  Exploring the Performance Fluctuations of HPC Workloads on Clouds , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[79]  Minlan Yu,et al.  Profiling Network Performance for Multi-tier Data Center Applications , 2011, NSDI.

[80]  Svetozar Miuÿ,et al.  DejaVu: Accelerating Resource Allocation in Virtualized Environments , 2012 .

[81]  Amin Vahdat,et al.  Dynamic Scheduling of Virtual Machines Running HPC Workloads in Scientific Grids , 2007, 2009 3rd International Conference on New Technologies, Mobility and Security.