Learned Autoscaling for Cloud Microservices with Multi-Armed Bandits

As cloud applications shift from monolithic architectures to loosely coupled microservices, several challenges in resource management arise. Application developers are tasked with determining compute capacity needed for each microservice in an application. This allocation dictates both the cost and performance of the application and typically relies on using either machine utilization (e.g. CPU, RAM) metrics. While utilization based policies are often simple to configure, easily understood, and require no training or retraining cost such policies offer no guarantees or expectations of end user latency. We design, implement and evaluate a microservice autoscaling system, COLA, which efficiently learns to manage cluster resources based on user provided end-to-end latency targets and cost objectives rather than optimizing utilization metrics. Our approach, COLA, relies on training a contextual multi armed bandit on representative workloads for an application and uses techniques to generalize performance to unseen workloads. We evaluate workloads of varying complexity including those with a fixed rate, diurnal pattern and dynamic request distribution. Across a set of five open-source microservice applications, we compare COLA against a variety of utilization and machine learning baselines. We find COLA provides the most cost effective autoscaling solution for a desired median or tail latency target on 13 of 18 workloads. On average, clusters managed by COLA cost 25.1% fewer dollars than the next closest alternative that meets a specified target latency. We discuss several optimizations, inspired by systems and machine learning literature, we make during training to efficiently explore the space of possible microservice configurations. These optimizations enable us to train our models over the course of a few hours. The cost savings from managing a cluster with COLA result in the system paying for its own training cost within a few days.

[1]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[2]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[3]  Quan Chen,et al.  PowerChief: Intelligent power allocation for multi-stage applications to improve responsiveness on power constrained CMP , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[4]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[5]  Randy H. Katz,et al.  Selecting the best VM across multiple public clouds: a data-driven performance modeling approach , 2017, SoCC.

[6]  Krzysztof Rzadca,et al.  Autopilot: workload autoscaling at Google , 2020, EuroSys.

[7]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[8]  Christina Delimitrou,et al.  The Architectural Implications of Cloud Microservices , 2018, IEEE Computer Architecture Letters.

[9]  Rami Bahsoon,et al.  Performance Modelling and Verification of Cloud-Based Auto-Scaling Policies , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[10]  Christina Delimitrou,et al.  Tarcil: reconciling scheduling speed and quality in large shared clusters , 2015, SoCC.

[11]  Christina Delimitrou,et al.  Unveiling the Hardware and Software Implications of Microservices in Cloud and Edge Systems , 2020, IEEE Micro.

[12]  Bruno Schulze,et al.  An Analysis of Public Clouds Elasticity in the Execution of Scientific Applications: a Survey , 2016, Journal of Grid Computing.

[13]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[15]  Brighten Godfrey,et al.  Low latency via redundancy , 2013, CoNEXT.

[16]  Jialin Li,et al.  Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency , 2014, SoCC.

[17]  Jun Sun,et al.  Poster: Benchmarking Microservice Systems for Software Engineering Research , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion).

[18]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[19]  S. Wittevrongel,et al.  Queueing Systems , 2019, Introduction to Stochastic Processes and Simulation.

[20]  Subho Sankar Banerjee,et al.  FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices , 2020, OSDI.

[21]  Christina Delimitrou,et al.  Sinan: ML-based and QoS-aware resource management for cloud microservices , 2021, ASPLOS.

[22]  Alexandru Iosup,et al.  An Experimental Performance Evaluation of Autoscaling Policies for Complex Workflows , 2017, ICPE.

[23]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[24]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[25]  Xiaohui Gu,et al.  CloudScale: elastic resource scaling for multi-tenant cloud systems , 2011, SoCC.

[26]  Norman W. Paton,et al.  Adaptation in cloud resource configuration: a survey , 2016, Journal of Cloud Computing.

[27]  José Antonio Lozano,et al.  A Review of Auto-scaling Techniques for Elastic Applications in Cloud Environments , 2014, Journal of Grid Computing.

[28]  Devesh Tiwari,et al.  Exploring Potential for Non-Disruptive Vertical Auto Scaling and Resource Estimation in Kubernetes , 2019, 2019 IEEE 12th International Conference on Cloud Computing (CLOUD).

[29]  Daniel J. Sorin,et al.  Communication breakdown: analyzing CPU usage in commercial Web workloads , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[30]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[31]  Zhenhuan Gong,et al.  PRESS: PRedictive Elastic ReSource Scaling for cloud systems , 2010, 2010 International Conference on Network and Service Management.

[32]  Xiaohui Gu,et al.  AGILE: Elastic Distributed Resource Scaling for Infrastructure-as-a-Service , 2013, ICAC.

[33]  Aleksandrs Slivkins,et al.  Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[34]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[35]  Yi Liu,et al.  An Efficient Bandit Algorithm for Realtime Multivariate Optimization , 2017, KDD.