George: Learning to Place Long-Lived Containers in Large Clusters with Operation Constraints

Online cloud services are widely deployed as Long-Running Applications (LRAs) hosted in containers. Placing LRA containers turns out to be particularly challenging due to the complex interference between co-located containers and the operation constraints in production clusters such as fault tolerance, disaster avoidance and incremental deployment. Existing schedulers typically provide APIs for operators to manually specify the container scheduling requirements and offer only qualitative scheduling guidelines for container placement. Such schedulers, do not perform well in terms of both performance and scale, while also requiring manual intervention. In this work, we propose George, an end-to-end generalpurpose LRA scheduler by leveraging the state-of-the-art Reinforcement Learning (RL) techniques to intelligently schedule LRA containers. We present an optimal container placement formulation for the first time with the objective of maximizing container placement performance subject to a set of operation constraints. One fundamental challenge in scheduling is to categorically satisfy different operation constraints in practice; specifically, to guarantee hard constraints and ensure soft constraints violations within a pre-defined threshold. We design a novel projection-based proximal policy optimization (PPPO) algorithm in combination with an Integer Linear optimization technique to intelligently schedule LRA containers under operation constraints. In order to reduce the training time, we apply transfer learning technique by taking advantage of the similarity in different LRA scheduling events. We prove theoretically that our proposed algorithm is effective, stable, and safe. We implement George as a plug-in service in Docker Swarm. Our in-house cluster demonstrates that George can maximize the LRA performance while enforcing the hard constraints and the soft constraints with a pre-defined threshold. The experiments show that George improves LRA performance and scale tremendously by requiring less than 1 hour scheduling time in a large cluster with 2K containers and 700 machines, 16x faster than existing schedulers. Compared with state-of-the-art alternatives, George also achieves 26% higher container performance with up to 70% lower constraint violation.

[1]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[2]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[3]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[4]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[5]  Luping Wang,et al.  Metis: Learning to Schedule Long-Running Applications in Shared Container Clusters at Scale , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Kaushik Veeraraghavan,et al.  Canopy: An End-to-End Performance Tracing And Analysis System , 2017, SOSP.

[7]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[8]  Jing Guo,et al.  Who Limits the Resource Efficiency of My Datacenter: An Analysis of Alibaba Datacenter Traces , 2019, 2019 IEEE/ACM 27th International Symposium on Quality of Service (IWQoS).

[9]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[10]  Tao Huang,et al.  Aladdin: Optimized Maximum Flow Management for Shared Production Clusters , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[11]  Aman Kansal,et al.  Q-clouds: managing performance interference effects for QoS-aware clouds , 2010, EuroSys '10.

[12]  Karthik Narasimhan,et al.  Projection-Based Constrained Policy Optimization , 2020, ICLR.

[13]  Peter R. Pietzuch,et al.  Medea: scheduling of long running applications in shared production clusters , 2018, EuroSys.

[14]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[15]  Chuan Wu,et al.  Deep Learning-based Job Placement in Distributed Machine Learning Clusters , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[16]  Kejiang Ye,et al.  Imbalance in the cloud: An analysis on Alibaba cluster trace , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[17]  Zhibin Yu,et al.  The Elasticity and Plasticity in Semi-Containerized Co-locating Cloud Workload: a View from Alibaba Trace , 2018, SoCC.

[18]  Nikolaj Bjørner,et al.  Z3: An Efficient SMT Solver , 2008, TACAS.

[19]  Peter Stone,et al.  Autonomous transfer for reinforcement learning , 2008, AAMAS.

[20]  Hongzi Mao,et al.  Learning scheduling algorithms for data processing clusters , 2018, SIGCOMM.

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Mahmut T. Kandemir,et al.  Phoenix: A Constraint-Aware Scheduler for Heterogeneous Datacenters , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[23]  Wei Wang,et al.  Continuum: A Platform for Cost-Aware, Low-Latency Continual Learning , 2018, SoCC.

[24]  مسعود رسول آبادی,et al.  2011 , 2012, The Winning Cars of the Indianapolis 500.

[25]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[26]  Xin Wang,et al.  Clipper: A Low-Latency Online Prediction Serving System , 2016, NSDI.

[27]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[28]  Ricardo Bianchini,et al.  DeepDive: Transparently Identifying and Managing Performance Interference in Virtualized Environments , 2013, USENIX Annual Technical Conference.

[29]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[30]  Daniel A. Menascé,et al.  TPC-W: A Benchmark for E-Commerce , 2002, IEEE Internet Comput..

[31]  MahadevanSridhar,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003 .

[32]  Jackson P. Matsuura,et al.  Using Transfer Learning to Speed-Up Reinforcement Learning: A Cased-Based Approach , 2010, 2010 Latin American Robotics Symposium and Intelligent Robotics Meeting.

[33]  Fabian Hueske,et al.  Apache Flink , 2019, Encyclopedia of Big Data Technologies.

[34]  Matthias Sax,et al.  Apache Kafka , 2019, Encyclopedia of Big Data Technologies.

[35]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[36]  Josu Ceberio,et al.  Constrained Combinatorial Optimization with Reinforcement Learning , 2020, ArXiv.

[37]  Chita R. Das,et al.  Modeling and synthesizing task placement constraints in Google compute clusters , 2011, SoCC.

[38]  Xiao Zhang,et al.  CPI2: CPU performance isolation for shared compute clusters , 2013, EuroSys '13.

[39]  Michael I. Jordan,et al.  Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[40]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[41]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[42]  Hongzi Mao,et al.  Variance Reduction for Reinforcement Learning in Input-Driven Environments , 2018, ICLR.

[43]  Ali Anwar,et al.  Characterizing Co-located Datacenter Workloads: An Alibaba Case Study , 2018, APSys.

[44]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[45]  Shie Mannor,et al.  Reward Constrained Policy Optimization , 2018, ICLR.

[46]  Sameh Elnikety,et al.  Swayam: distributed autoscaling to meet SLAs of machine learning inference services with resource efficiency , 2017, Middleware.

[47]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[48]  Carlo Curino,et al.  Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications , 2015, SIGMOD Conference.

[49]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[50]  Mor Harchol-Balter,et al.  TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters , 2016, EuroSys.

[51]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[52]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[53]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[54]  John K. Karlof,et al.  Integer programming : theory and practice , 2005 .

[55]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.