Autopilot: workload autoscaling at Google

In many public and private Cloud systems, users need to specify a limit for the amount of resources (CPU cores and RAM) to provision for their workloads. A job that exceeds its limits might be throttled or killed, resulting in delaying or dropping end-user requests, so human operators naturally err on the side of caution and request a larger limit than the job needs. At scale, this results in massive aggregate resource wastage. To address this, Google uses Autopilot to configure resources automatically, adjusting both the number of concurrent tasks in a job (horizontal scaling) and the CPU/memory limits for individual tasks (vertical scaling). Autopilot walks the same fine line as human operators: its primary goal is to reduce slack - the difference between the limit and the actual resource usage - while minimizing the risk that a task is killed with an out-of-memory (OOM) error or its performance degraded because of CPU throttling. Autopilot uses machine learning algorithms applied to historical data about prior executions of a job, plus a set of finely-tuned heuristics, to walk this line. In practice, Autopiloted jobs have a slack of just 23%, compared with 46% for manually-managed jobs. Additionally, Autopilot reduces the number of jobs severely impacted by OOMs by a factor of 10. Despite its advantages, ensuring that Autopilot was widely adopted took significant effort, including making potential recommendations easily visible to customers who had yet to opt in, automatically migrating certain categories of jobs, and adding support for custom recommenders. At the time of writing, Autopiloted jobs account for over 48% of Google's fleet-wide resource usage.

[1]  Francisco Vilar Brasileiro,et al.  Long-term SLOs for reclaimed cloud computing resources , 2014, SoCC.

[2]  Xiao Zhang,et al.  CPI2: CPU performance isolation for shared compute clusters , 2013, EuroSys '13.

[3]  Zhenhuan Gong,et al.  PRESS: PRedictive Elastic ReSource Scaling for cloud systems , 2010, 2010 International Conference on Network and Service Management.

[4]  Krzysztof Rzadca,et al.  SLO-aware colocation of data center tasks based on instantaneous processor requirements , 2017, SoCC.

[5]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[6]  Kevin Lee,et al.  Empirical prediction models for adaptive resource provisioning in the cloud , 2012, Future Gener. Comput. Syst..

[7]  Claus Pahl,et al.  Self-Learning Cloud Controllers: Fuzzy Q-Learning for Knowledge Evolution , 2015, 2015 International Conference on Cloud and Autonomic Computing.

[8]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[9]  Randy H. Katz,et al.  Selecting the best VM across multiple public clouds: a data-driven performance modeling approach , 2017, SoCC.

[10]  Eric A. Brewer,et al.  Borg, Omega, and Kubernetes , 2016, ACM Queue.

[11]  Chao Li,et al.  ROSE: Cluster Resource Scheduling via Speculative Over-Subscription , 2018, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[12]  Enda Barrett,et al.  CPU workload forecasting of machines in data centers using LSTM recurrent neural networks and ARIMA models , 2017, 2017 12th International Conference for Internet Technology and Secured Transactions (ICITST).

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Mor Harchol-Balter,et al.  Borg: the next generation , 2020, EuroSys.

[15]  Rami Bahsoon,et al.  Performance Modelling and Verification of Cloud-Based Auto-Scaling Policies , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[16]  Carlo Curino,et al.  Hydra: a federated resource manager for data-center scale analytics , 2019, NSDI.

[17]  Bruno Schulze,et al.  An Analysis of Public Clouds Elasticity in the Execution of Scientific Applications: a Survey , 2016, Journal of Grid Computing.

[18]  David Breitgand,et al.  Improving consolidation of virtual machines with risk-aware bandwidth oversubscription in compute clouds , 2012, 2012 Proceedings IEEE INFOCOM.

[19]  Enda Barrett,et al.  A multitime‐steps‐ahead prediction approach for scheduling live migration in cloud data centers , 2018, Softw. Pract. Exp..

[20]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[21]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[22]  Michael Gerndt,et al.  IaaS Reactive Autoscaling Performance Challenges , 2018, 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).

[23]  Norman W. Paton,et al.  Adaptation in cloud resource configuration: a survey , 2016, Journal of Cloud Computing.

[24]  Alexandru Iosup,et al.  An Experimental Performance Evaluation of Autoscaling Policies for Complex Workflows , 2017, ICPE.

[25]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[26]  José Antonio Lozano,et al.  A Review of Auto-scaling Techniques for Elastic Applications in Cloud Environments , 2014, Journal of Grid Computing.

[27]  Srikanth Kandula,et al.  Multi-resource packing for cluster schedulers , 2014, SIGCOMM.

[28]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[29]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[30]  Aniruddha S. Gokhale,et al.  Efficient Autoscaling in the Cloud Using Predictive Models for Workload Forecasting , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[31]  Parijat Dube,et al.  Adaptive, Model-driven Autoscaling for Cloud Applications , 2014, ICAC.

[32]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[33]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[34]  Kejiang Ye,et al.  Imbalance in the cloud: An analysis on Alibaba cluster trace , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[35]  Devesh Tiwari,et al.  Exploring Potential for Non-Disruptive Vertical Auto Scaling and Resource Estimation in Kubernetes , 2019, 2019 IEEE 12th International Conference on Cloud Computing (CLOUD).