ADARES: Adaptive Resource Management for Virtual Machines

Virtual execution environments allow for consolidation of multiple applications onto the same physical server, thereby enabling more efficient use of server resources. However, users often statically configure the resources of virtual machines through guesswork, resulting in either insufficient resource allocations that hinder VM performance, or excessive allocations that waste precious data center resources. In this paper, we first characterize real-world resource allocation and utilization of VMs through the analysis of an extensive dataset, consisting of more than 250k VMs from over 3.6k private enterprise clusters. Our large-scale analysis confirms that VMs are often misconfigured, either overprovisioned or underprovisioned, and that this problem is pervasive across a wide range of private clusters. We then propose ADARES, an adaptive system that dynamically adjusts VM resources using machine learning techniques. In particular, ADARES leverages the contextual bandits framework to effectively manage the adaptations. Our system exploits easily collectible data, at the cluster, node, and VM levels, to make more sensible allocation decisions, and uses transfer learning to safely explore the configurations space and speed up training. Our empirical evaluation shows that ADARES can significantly improve system utilization without sacrificing performance. For instance, when compared to threshold and prediction-based baselines, it achieves more predictable VM-level performance and also reduces the amount of virtual CPUs and memory provisioned by up to 35% and 60% respectively for synthetic workloads on real clusters.

[1]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[2]  Prashant J. Shenoy,et al.  Resource overbooking and application profiling in shared hosting platforms , 2002, OSDI '02.

[3]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[4]  Jeffrey S. Chase,et al.  Active and accelerated learning of cost models for optimizing scientific applications , 2006, VLDB.

[5]  Kang G. Shin,et al.  Adaptive control of virtualized resources in utility computing environments , 2007, EuroSys '07.

[6]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[7]  Jerome A. Rolia,et al.  Workload Analysis and Demand Prediction of Enterprise Data Center Applications , 2007, 2007 IEEE 10th International Symposium on Workload Characterization.

[8]  Andrzej Kochut,et al.  Dynamic Placement of Virtual Machines for Managing SLA Violations , 2007, 2007 10th IFIP/IEEE International Symposium on Integrated Network Management.

[9]  Christopher Stewart,et al.  A Dollar from 15 Cents: Cross-Platform Management for Internet Services , 2008, USENIX Annual Technical Conference.

[10]  Xiaoyun Zhu,et al.  1000 islands: an integrated approach to resource management for virtualized data centers , 2009, Cluster Computing.

[11]  Ashraf Aboulnaga,et al.  Automatic virtual machine configuration for database workloads , 2008, SIGMOD Conference.

[12]  Xiaoyun Zhu,et al.  1000 Islands: Integrated Capacity and Workload Management for the Next Generation Data Center , 2008, 2008 International Conference on Autonomic Computing.

[13]  Le Yi Wang,et al.  VCONF: a reinforcement learning approach to virtual machines auto-configuration , 2009, ICAC '09.

[14]  Michael I. Jordan,et al.  Automatic exploration of datacenter performance regimes , 2009, ACDC '09.

[15]  Steven Hand,et al.  Self-adaptive and self-configured CPU resource provisioning for virtualized servers using Kalman filters , 2009, ICAC '09.

[16]  Jose Renato Santos,et al.  JustRunIt: Experiment-Based Management of Virtualized Data Centers , 2009, USENIX Annual Technical Conference.

[17]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[18]  Anand Sivasubramaniam,et al.  Statistical profiling-based techniques for effective power provisioning in data centers , 2009, EuroSys '09.

[19]  Archana Ganapathi,et al.  Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[20]  Chita R. Das,et al.  Towards characterizing cloud backend workloads: insights from Google compute clusters , 2010, PERV.

[21]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[22]  Zhenhuan Gong,et al.  PRESS: PRedictive Elastic ReSource Scaling for cloud systems , 2010, 2010 International Conference on Network and Service Management.

[23]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[24]  Isis Truck,et al.  From Data Center Resource Allocation to Control Theory and Back , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[25]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[26]  Prashant J. Shenoy,et al.  Empirical evaluation of latency-sensitive application performance in the cloud , 2010, MMSys '10.

[27]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[28]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[29]  Alexandru Iosup,et al.  On the Performance Variability of Production Cloud Services , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[30]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[31]  Isis Truck,et al.  Using Reinforcement Learning for Autonomic Resource Allocation in Clouds: towards a fully automated workflow , 2011 .

[32]  Srikanth Kandula,et al.  Jockey: guaranteed job latency in data parallel clusters , 2012, EuroSys '12.

[33]  Ricardo Bianchini,et al.  DejaVu: accelerating resource allocation in virtualized environments , 2012, ASPLOS XVII.

[34]  David A. Maltz,et al.  Surviving failures in bandwidth-constrained datacenters , 2012, CCRV.

[35]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[36]  Benjamin Farley,et al.  More for your money: exploiting performance heterogeneity in public clouds , 2012, SoCC '12.

[37]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[38]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[39]  Cheng-Zhong Xu,et al.  Coordinated Self-Configuration of Virtual Machines and Appliances Using a Model-Free Learning Approach , 2013, IEEE Transactions on Parallel and Distributed Systems.

[40]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[41]  Ricardo Bianchini,et al.  DeepDive: Transparently Identifying and Managing Performance Interference in Virtualized Environments , 2013, USENIX Annual Technical Conference.

[42]  Fabien Hermenier,et al.  BtrPlace: A Flexible Consolidation Manager for Highly Available Applications , 2013, IEEE Transactions on Dependable and Secure Computing.

[43]  Long Wang,et al.  PseudoApp: Performance prediction for application migration to cloud , 2013, 2013 IFIP/IEEE International Symposium on Integrated Network Management (IM 2013).

[44]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[45]  Andreas Krause,et al.  Submodular Function Maximization , 2014, Tractability.

[46]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[47]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[48]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[49]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[50]  Arvind Krishnamurthy,et al.  Characterizing Private Clouds: A Large-Scale Empirical Analysis of Enterprise Clusters , 2016, SoCC.

[51]  Christina Delimitrou,et al.  HCloud: Resource-Efficient Provisioning in Shared Cloud Systems , 2016, ASPLOS.

[52]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[53]  Claus Pahl,et al.  A Comparison of Reinforcement Learning Techniques for Fuzzy Cloud Auto-Scaling , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[54]  Marc Peter Deisenroth,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[55]  Anil A. Bharath,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[56]  Ambuj Tewari,et al.  From Ads to Interventions: Contextual Bandits in Mobile Health , 2017, Mobile Health - Sensors, Analytic Methods, and Applications.

[57]  Randy H. Katz,et al.  Selecting the best VM across multiple public clouds: a data-driven performance modeling approach , 2017, SoCC.

[58]  R. Preston McAfee,et al.  Usage Patterns and the Economics of the Public Cloud , 2017, WWW.

[59]  Dileep George,et al.  Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics , 2017, ICML.

[60]  Ricardo Bianchini,et al.  Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms , 2017, SOSP.

[61]  Danny Jones,et al.  VM Live Migration At Scale , 2018, VEE.

[62]  VMware vSphere 5 . 1 vMotion Architecture , Performance , and Best Practices , .